Rādīt vienkāršu vienuma ierakstu

 
dc.contributor.author Znotiņš, Artūrs
dc.date.accessioned 2021-05-27T14:13:26Z
dc.date.available 2021-05-27T14:13:26Z
dc.date.issued 2020
dc.identifier.uri http://hdl.handle.net/20.500.12574/43
dc.description LVBERT is the first publicly available monolingual BERT language model pre-trained for Latvian. For training we used the original implementation of BERT on TensorFlow with the whole-word masking and the next sentence prediction objectives. We used BERT-BASE configuration with 12 layers, 768 hidden units, 12 heads, 128 sequence length, 128 mini-batch size and 32,000 token vocabulary.
dc.language.iso lav
dc.publisher AiLab IMCS UL
dc.relation.isreferencedby https://ebooks.iospress.nl/volumearticle/55531
dc.rights GNU General Public Licence, version 3
dc.rights.uri http://opensource.org/licenses/GPL-3.0
dc.rights.label PUB
dc.source.uri https://github.com/LUMII-AILab/LVBERT
dc.subject BERT
dc.subject language model
dc.title LVBERT - Latvian BERT
dc.type toolService
metashare.ResourceInfo#ContentInfo.detailedType other
metashare.ResourceInfo#ResourceComponentType#ToolServiceInfo.languageDependent true
has.files yes
branding CLARIN Centre of Latvian language resources and tools
contact.person Artūrs Znotiņš arturs.znotins@lumii.lv IMCS UL
sponsor Latvian Council of Science lzp-2018/2-0216 Latvian Language Understanding and Generation in Human-Computer Interaction nationalFunds
files.size 1626129463
files.count 3


 Faili šajā vienumā

Šis vienums ir
Publicly Available
un ir licencēts saskaņā ar:
GNU General Public Licence, version 3
Icon
Vārds
lvbert_tf.tar.gz
Lielums
1.13 GB
Formāts
application/gzip
Apraksts
TensorFlow model
MD5
0112f2ad7eb39ed57ef301734dbb6057
 Lejupielādēt failu  Priekšskatījums
 Faila priekšskatījums  
    • model.ckpt-10000000.data-00000-of-000011 GB
    • bert_config.json504 B
    • vocab.txt273 kB
    • model.ckpt-10000000.index9 kB
    • model.ckpt-10000000.meta4 MB
Icon
Vārds
lvbert_pytorch.tar.gz
Lielums
393.96 MB
Formāts
application/gzip
Apraksts
PyTorch model
MD5
156dd075d71807a761db29169f2f5d73
 Lejupielādēt failu  Priekšskatījums
 Faila priekšskatījums  
    • pytorch_model.bin424 MB
    • bert_config.json504 B
    • vocab.txt273 kB
Icon
Vārds
readme LVBERT.txt
Lielums
980 baiti
Formāts
Teksta fails
Apraksts
Documentation of Latvian BERT model.
MD5
7c8d2c2e4474166144671fffca642488
 Lejupielādēt failu  Priekšskatījums
 Faila priekšskatījums  
LVBERT is a Latvian BERT model, trained on 0.5 billions Latvian tokens from Latvian Balanced corpus, Wikipedia, comments and articles from various news portals.

LVBERT uses BERT-Base configuration with 12 trasformer layers, each of 768 hidden units, 12 heads, 128 sequence length, 128 mini-batch size. It was trained using original TensorFlow code for BERT with the whole-word masking and the next sentence prediction objectives.

Sentencepiece model tokenizes text into subword tokens that are used by the the BERT model. The file "vocab.txt" lists all the subword tokens. Tokens prepended with "##" are only used for continuation, ie. they're necessarily combined with a preceding token(s) to form a full word. The total vocabulary of the model is composed of 32000 subword tokens.

For practical use, LVBERT model is available at Hugging Face: https://huggingface.co/AiLab-IMCS-UL/lvbert

LVBERT model is published under GNU General Public Licence (version 3) license. . . .
                                            

Rādīt vienkāršu vienuma ierakstu