Show simple item record

 
dc.contributor.author Deksne, Daiga
dc.date.accessioned 2025-09-16T14:56:25Z
dc.date.available 2025-09-16T14:56:25Z
dc.date.issued 2025-08-27
dc.identifier.uri http://hdl.handle.net/20.500.12574/136
dc.description Dataset for Embedding Model Fine-Tuning has been created within the framework of the National Research Program project "Analysis of the applicability of artificial intelligence methods in the field of EU fund projects". For the purposes of this project, we fine-tuned the bge-m3 model developed by BAAI (Chen et al., 2024). For fine-tuning, we collected DOCX procurement documents from the Electronic Procurement System (https://www.eis.gov.lv/EKEIS/Supplier/), allocating 7,083 files to the training set and 50 files to the validation set. The text from these documents was extracted and segmented. For each text segment, we used the OpenAI gpt-4o model to generate statements or questions whose correctness can be verified against that specific segment. License and Attribution The dataset is distributed under the CC-BY-NC-SA license: https://creativecommons.org/licenses/by-nc-sa/4.0/. When using this dataset, please cite as: Project "Analysis of the Applicability of Artificial Intelligence Methods in the Field of European Union Fund Projects" (VPP-CFLA-Mākslīgais intelekts-2024/1-0003). Dataset for Embedding Model Fine-Tuning. Licensed under CC BY-NC-SA 4.0.
dc.language.iso lav
dc.publisher University of Latvia
dc.rights Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)
dc.rights.uri http://creativecommons.org/licenses/by-nc-sa/4.0/
dc.rights.label PUB
dc.source.uri https://www.lzp.gov.lv/lv/projekts/maksliga-intelekta-metozu-piemerotibas-analize-eiropas-savienibas-fondu-projektu-joma
dc.subject Embedding Model
dc.subject Fine-Tuning
dc.title Embedding Model Fine-Tuning Dataset
dc.type corpus
metashare.ResourceInfo#ContentInfo.mediaType text
has.files yes
branding CLARIN Centre of Latvian language resources and tools
contact.person Raivis Skadiņš raivis.skadins@lu.lv University of Latvia
sponsor Latvian Council of Science VPP-CFLA-Mākslīgais intelekts-2024/1-0003 Mākslīgā intelekta metožu piemērotības analīze Eiropas Savienības fondu projektu jomā nationalFunds
size.info 432895 items
files.size 169057845
files.count 4


 Files in this item

 Download all files in item (161.23 MB)
Icon
Name
README.md
Size
2.41 KB
Format
Unknown
Description
English readme file describing the dataset
MD5
79428de63cf91b48c50e9193d4a95eec
 Download file
Icon
Name
README-LV.md
Size
2.47 KB
Format
Unknown
Description
Latvian readme file describing the dataset
MD5
cc70c10ab77ace2285103aa08720decc
 Download file
Icon
Name
train_dataset.zip
Size
160.18 MB
Format
application/zip
Description
Training set
MD5
c892299c10f18de8dd40ca872edb0a4d
 Download file  Preview
 File Preview  
    • train_dataset.json-1 B
Icon
Name
val_dataset.zip
Size
1.04 MB
Format
application/zip
Description
Validation set
MD5
52d694b5b7dc9b299f443fff39c9e888
 Download file  Preview
 File Preview  
    • val_dataset.json-1 B

Show simple item record