LVBERT - Latvian BERT

Name: LVBERT - Latvian BERT
License: http://opensource.org/licenses/GPL-3.0

Znotiņš, Artūrs

LVBERT - Latvian BERT

CLARIN Centre of Latvian language resources and tools

Autori: Znotiņš, Artūrs

Vienuma identifikators: http://hdl.handle.net/20.500.12574/43

Projekta URL: https://github.com/LUMII-AILab/LVBERT

Norāde: https://ebooks.iospress.nl/volumearticle/55531

Izdošanas datums: 2020

Tips: toolService

Valoda(-s): Latvian

Apraksts: LVBERT is the first publicly available monolingual BERT language model pre-trained for Latvian. For training we used the original implementation of BERT on TensorFlow with the whole-word masking and the next sentence prediction objectives. We used BERT-BASE configuration with 12 layers, 768 hidden units, 12 heads, 128 sequence length, 128 mini-batch size and 32,000 token vocabulary.

Izdevējs: AiLab IMCS UL

Tēma(-s): BERT language model

Kolekcija (s): Language resources and tools of AiLab IMCS UL

Rādīt pilnu ierakstu

Faili šajā vienumā

Šis vienums ir

Publicly Available

un ir licencēts saskaņā ar:
GNU General Public Licence, version 3

Vārds: lvbert_tf.tar.gz
Lielums: 1.13 GB
Formāts: application/gzip
Apraksts: TensorFlow model
MD5: 0112f2ad7eb39ed57ef301734dbb6057

Lejupielādēt failu Priekšskatījums

Faila priekšskatījums

- model.ckpt-10000000.data-00000-of-000011 GB
- bert_config.json504 B
- vocab.txt273 kB
- model.ckpt-10000000.index9 kB
- model.ckpt-10000000.meta4 MB

Vārds: lvbert_pytorch.tar.gz
Lielums: 393.96 MB
Formāts: application/gzip
Apraksts: PyTorch model
MD5: 156dd075d71807a761db29169f2f5d73

Lejupielādēt failu Priekšskatījums

Faila priekšskatījums

- pytorch_model.bin424 MB
- bert_config.json504 B
- vocab.txt273 kB

Vārds: readme LVBERT.txt
Lielums: 980 baiti
Formāts: Teksta fails
Apraksts: Documentation of Latvian BERT model.
MD5: 7c8d2c2e4474166144671fffca642488

Lejupielādēt failu Priekšskatījums

Faila priekšskatījums

LVBERT is a Latvian BERT model, trained on 0.5 billions Latvian tokens from Latvian Balanced corpus, Wikipedia, comments and articles from various news portals.

LVBERT uses BERT-Base configuration with 12 trasformer layers, each of 768 hidden units, 12 heads, 128 sequence length, 128 mini-batch size. It was trained using original TensorFlow code for BERT with the whole-word masking and the next sentence prediction objectives.

Sentencepiece model tokenizes text into subword tokens that are used by the the BERT model. The file "vocab.txt" lists all the subword tokens. Tokens prepended with "##" are only used for continuation, ie. they're necessarily combined with a preceding token(s) to form a full word. The total vocabulary of the model is composed of 32000 subword tokens.

For practical use, LVBERT model is available at Hugging Face: https://huggingface.co/AiLab-IMCS-UL/lvbert

LVBERT model is published under GNU General Public Licence (version 3) license. . . .

LVBERT - Latvian BERT

Faili šajā vienumā

Partneri, koordinācija, finansējums

Repozitorijs

Papildus