Rādīt vienkāršu vienuma ierakstu
dc.contributor.author | Znotiņš, Artūrs |
dc.date.accessioned | 2021-05-27T14:13:26Z |
dc.date.available | 2021-05-27T14:13:26Z |
dc.date.issued | 2020 |
dc.identifier.uri | http://hdl.handle.net/20.500.12574/43 |
dc.description | LVBERT is the first publicly available monolingual BERT language model pre-trained for Latvian. For training we used the original implementation of BERT on TensorFlow with the whole-word masking and the next sentence prediction objectives. We used BERT-BASE configuration with 12 layers, 768 hidden units, 12 heads, 128 sequence length, 128 mini-batch size and 32,000 token vocabulary. |
dc.language.iso | lav |
dc.publisher | AiLab IMCS UL |
dc.relation.isreferencedby | https://ebooks.iospress.nl/volumearticle/55531 |
dc.rights | GNU General Public Licence, version 3 |
dc.rights.uri | http://opensource.org/licenses/GPL-3.0 |
dc.rights.label | PUB |
dc.source.uri | https://github.com/LUMII-AILab/LVBERT |
dc.subject | BERT |
dc.subject | language model |
dc.title | LVBERT - Latvian BERT |
dc.type | toolService |
metashare.ResourceInfo#ContentInfo.detailedType | other |
metashare.ResourceInfo#ResourceComponentType#ToolServiceInfo.languageDependent | true |
has.files | yes |
branding | CLARIN Centre of Latvian language resources and tools |
contact.person | Artūrs Znotiņš arturs.znotins@lumii.lv IMCS UL |
sponsor | Latvian Council of Science lzp-2018/2-0216 Latvian Language Understanding and Generation in Human-Computer Interaction nationalFunds |
files.size | 1626129463 |
files.count | 3 |
Faili šajā vienumā
- Vārds
- lvbert_tf.tar.gz
- Lielums
- 1.13 GB
- Formāts
- application/gzip
- Apraksts
- TensorFlow model
- MD5
- 0112f2ad7eb39ed57ef301734dbb6057
- Vārds
- lvbert_pytorch.tar.gz
- Lielums
- 393.96 MB
- Formāts
- application/gzip
- Apraksts
- PyTorch model
- MD5
- 156dd075d71807a761db29169f2f5d73
- Vārds
- readme LVBERT.txt
- Lielums
- 980 baiti
- Formāts
- Teksta fails
- Apraksts
- Documentation of Latvian BERT model.
- MD5
- 7c8d2c2e4474166144671fffca642488
LVBERT is a Latvian BERT model, trained on 0.5 billions Latvian tokens from Latvian Balanced corpus, Wikipedia, comments and articles from various news portals. LVBERT uses BERT-Base configuration with 12 trasformer layers, each of 768 hidden units, 12 heads, 128 sequence length, 128 mini-batch size. It was trained using original TensorFlow code for BERT with the whole-word masking and the next sentence prediction objectives. Sentencepiece model tokenizes text into subword tokens that are used by the the BERT model. The file "vocab.txt" lists all the subword tokens. Tokens prepended with "##" are only used for continuation, ie. they're necessarily combined with a preceding token(s) to form a full word. The total vocabulary of the model is composed of 32000 subword tokens. For practical use, LVBERT model is available at Hugging Face: https://huggingface.co/AiLab-IMCS-UL/lvbert LVBERT model is published under GNU General Public Licence (version 3) license. . . .