LVBERT is the first publicly available monolingual BERT language model pre-trained for Latvian. For training we used the original implementation of BERT on TensorFlow with the whole-word masking and the next sentence prediction objectives. We used BERT-BASE configuration with 12 layers, 768 hidden units, 12 heads, 128 sequence length, 128 mini-batch size and 32,000 token vocabulary.
Activities of CLARIN Latvia are supported by project of the European Regional Development Fund Nr. 220.127.116.11/18/I/016 University of Latvia and institutes in the European Research Area - Excellency, activity, mobility, capacity.