Rādīt vienkāršu vienuma ierakstu

 
dc.contributor.author Deksne, Daiga
dc.date.accessioned 2025-09-16T14:56:25Z
dc.date.available 2025-09-16T14:56:25Z
dc.date.issued 2025-08-27
dc.identifier.uri http://hdl.handle.net/20.500.12574/136
dc.description Dataset for Embedding Model Fine-Tuning has been created within the framework of the National Research Program project "Analysis of the applicability of artificial intelligence methods in the field of EU fund projects". For the purposes of this project, we fine-tuned the bge-m3 model developed by BAAI (Chen et al., 2024). For fine-tuning, we collected DOCX procurement documents from the Electronic Procurement System (https://www.eis.gov.lv/EKEIS/Supplier/), allocating 7,083 files to the training set and 50 files to the validation set. The text from these documents was extracted and segmented. For each text segment, we used the OpenAI gpt-4o model to generate statements or questions whose correctness can be verified against that specific segment. License and Attribution The dataset is distributed under the CC-BY-NC-SA license: https://creativecommons.org/licenses/by-nc-sa/4.0/. When using this dataset, please cite as: Project "Analysis of the Applicability of Artificial Intelligence Methods in the Field of European Union Fund Projects" (VPP-CFLA-Mākslīgais intelekts-2024/1-0003). Dataset for Embedding Model Fine-Tuning. Licensed under CC BY-NC-SA 4.0.
dc.language.iso lav
dc.publisher University of Latvia
dc.rights Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)
dc.rights.uri http://creativecommons.org/licenses/by-nc-sa/4.0/
dc.rights.label PUB
dc.source.uri https://www.lzp.gov.lv/lv/projekts/maksliga-intelekta-metozu-piemerotibas-analize-eiropas-savienibas-fondu-projektu-joma
dc.subject Embedding Model
dc.subject Fine-Tuning
dc.title Embedding Model Fine-Tuning Dataset
dc.type corpus
metashare.ResourceInfo#ContentInfo.mediaType text
has.files yes
branding CLARIN Centre of Latvian language resources and tools
contact.person Raivis Skadiņš raivis.skadins@lu.lv University of Latvia
sponsor Latvian Council of Science VPP-CFLA-Mākslīgais intelekts-2024/1-0003 Mākslīgā intelekta metožu piemērotības analīze Eiropas Savienības fondu projektu jomā nationalFunds
size.info 432895 items
files.size 169057845
files.count 4


 Faili šajā vienumā

 Lejupielādēt visus vienuma failus (161.23 MB)
Šis vienums ir
Publicly Available
un ir licencēts saskaņā ar:
Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)
Icon
Vārds
README.md
Lielums
2.41 KB
Formāts
Nezināms
Apraksts
English readme file describing the dataset
MD5
79428de63cf91b48c50e9193d4a95eec
 Lejupielādēt failu
Icon
Vārds
README-LV.md
Lielums
2.47 KB
Formāts
Nezināms
Apraksts
Latvian readme file describing the dataset
MD5
cc70c10ab77ace2285103aa08720decc
 Lejupielādēt failu
Icon
Vārds
train_dataset.zip
Lielums
160.18 MB
Formāts
application/zip
Apraksts
Training set
MD5
c892299c10f18de8dd40ca872edb0a4d
 Lejupielādēt failu  Priekšskatījums
 Faila priekšskatījums  
    • train_dataset.json-1 B
Icon
Vārds
val_dataset.zip
Lielums
1.04 MB
Formāts
application/zip
Apraksts
Validation set
MD5
52d694b5b7dc9b299f443fff39c9e888
 Lejupielādēt failu  Priekšskatījums
 Faila priekšskatījums  
    • val_dataset.json-1 B

Rādīt vienkāršu vienuma ierakstu