Files in this item

This item is
Publicly Available
and licensed under:
GNU General Public Licence, version 3
Icon
Name
lvbert_tf.tar.gz
Size
1.13 GB
Format
application/gzip
Description
TensorFlow model
MD5
0112f2ad7eb39ed57ef301734dbb6057
 Download file  Preview
 File Preview  
    • model.ckpt-10000000.data-00000-of-000011 GB
    • bert_config.json504 B
    • vocab.txt273 kB
    • model.ckpt-10000000.index9 kB
    • model.ckpt-10000000.meta4 MB
Icon
Name
lvbert_pytorch.tar.gz
Size
393.96 MB
Format
application/gzip
Description
PyTorch model
MD5
156dd075d71807a761db29169f2f5d73
 Download file  Preview
 File Preview  
    • pytorch_model.bin424 MB
    • bert_config.json504 B
    • vocab.txt273 kB
Icon
Name
readme LVBERT.txt
Size
980 bytes
Format
Text file
Description
Documentation of Latvian BERT model.
MD5
7c8d2c2e4474166144671fffca642488
 Download file  Preview
 File Preview  
LVBERT is a Latvian BERT model, trained on 0.5 billions Latvian tokens from Latvian Balanced corpus, Wikipedia, comments and articles from various news portals.

LVBERT uses BERT-Base configuration with 12 trasformer layers, each of 768 hidden units, 12 heads, 128 sequence length, 128 mini-batch size. It was trained using original TensorFlow code for BERT with the whole-word masking and the next sentence prediction objectives.

Sentencepiece model tokenizes text into subword tokens that are used by the the BERT model. The file "vocab.txt" lists all the subword tokens. Tokens prepended with "##" are only used for continuation, ie. they're necessarily combined with a preceding token(s) to form a full word. The total vocabulary of the model is composed of 32000 subword tokens.

For practical use, LVBERT model is available at Hugging Face: https://huggingface.co/AiLab-IMCS-UL/lvbert

LVBERT model is published under GNU General Public Licence (version 3) license. . . .