LVBERT - Latvian BERT

Name: LVBERT - Latvian BERT
License: http://opensource.org/licenses/GPL-3.0

Znotiņš, Artūrs

dc.contributor.author	Znotiņš, Artūrs
dc.date.accessioned	2021-05-27T14:13:26Z
dc.date.available	2021-05-27T14:13:26Z
dc.date.issued	2020
dc.identifier.uri	http://hdl.handle.net/20.500.12574/43
dc.description	LVBERT is the first publicly available monolingual BERT language model pre-trained for Latvian. For training we used the original implementation of BERT on TensorFlow with the whole-word masking and the next sentence prediction objectives. We used BERT-BASE configuration with 12 layers, 768 hidden units, 12 heads, 128 sequence length, 128 mini-batch size and 32,000 token vocabulary.
dc.language.iso	lav
dc.publisher	AiLab IMCS UL
dc.relation.isreferencedby	https://ebooks.iospress.nl/volumearticle/55531
dc.rights	GNU General Public Licence, version 3
dc.rights.uri	http://opensource.org/licenses/GPL-3.0
dc.rights.label	PUB
dc.source.uri	https://github.com/LUMII-AILab/LVBERT
dc.subject	BERT
dc.subject	language model
dc.title	LVBERT - Latvian BERT
dc.type	toolService
metashare.ResourceInfo#ContentInfo.detailedType	other
metashare.ResourceInfo#ResourceComponentType#ToolServiceInfo.languageDependent	true
has.files	yes
branding	CLARIN Centre of Latvian language resources and tools
contact.person	Artūrs Znotiņš arturs.znotins@lumii.lv IMCS UL
sponsor	Latvian Council of Science lzp-2018/2-0216 Latvian Language Understanding and Generation in Human-Computer Interaction nationalFunds
files.size	1626129463
files.count	3

Files in this item

This item is

Publicly Available

and licensed under:
GNU General Public Licence, version 3

Name: lvbert_tf.tar.gz
Size: 1.13 GB
Format: application/gzip
Description: TensorFlow model
MD5: 0112f2ad7eb39ed57ef301734dbb6057

Download file Preview

File Preview

- model.ckpt-10000000.data-00000-of-000011 GB
- bert_config.json504 B
- vocab.txt273 kB
- model.ckpt-10000000.index9 kB
- model.ckpt-10000000.meta4 MB

Name: lvbert_pytorch.tar.gz
Size: 393.96 MB
Format: application/gzip
Description: PyTorch model
MD5: 156dd075d71807a761db29169f2f5d73

Download file Preview

File Preview

- pytorch_model.bin424 MB
- bert_config.json504 B
- vocab.txt273 kB

Name: readme LVBERT.txt
Size: 980 bytes
Format: Text file
Description: Documentation of Latvian BERT model.
MD5: 7c8d2c2e4474166144671fffca642488

Download file Preview

File Preview

LVBERT is a Latvian BERT model, trained on 0.5 billions Latvian tokens from Latvian Balanced corpus, Wikipedia, comments and articles from various news portals.

LVBERT uses BERT-Base configuration with 12 trasformer layers, each of 768 hidden units, 12 heads, 128 sequence length, 128 mini-batch size. It was trained using original TensorFlow code for BERT with the whole-word masking and the next sentence prediction objectives.

Sentencepiece model tokenizes text into subword tokens that are used by the the BERT model. The file "vocab.txt" lists all the subword tokens. Tokens prepended with "##" are only used for continuation, ie. they're necessarily combined with a preceding token(s) to form a full word. The total vocabulary of the model is composed of 32000 subword tokens.

For practical use, LVBERT model is available at Hugging Face: https://huggingface.co/AiLab-IMCS-UL/lvbert

LVBERT model is published under GNU General Public Licence (version 3) license. . . .

Show simple item record

Files in this item

Partners, Coordination, Funding

Repository

More