Rādīt vienkāršu vienuma ierakstu
dc.contributor.author | Deksne, Daiga |
dc.date.accessioned | 2025-09-16T14:56:25Z |
dc.date.available | 2025-09-16T14:56:25Z |
dc.date.issued | 2025-08-27 |
dc.identifier.uri | http://hdl.handle.net/20.500.12574/136 |
dc.description | Dataset for Embedding Model Fine-Tuning has been created within the framework of the National Research Program project "Analysis of the applicability of artificial intelligence methods in the field of EU fund projects". For the purposes of this project, we fine-tuned the bge-m3 model developed by BAAI (Chen et al., 2024). For fine-tuning, we collected DOCX procurement documents from the Electronic Procurement System (https://www.eis.gov.lv/EKEIS/Supplier/), allocating 7,083 files to the training set and 50 files to the validation set. The text from these documents was extracted and segmented. For each text segment, we used the OpenAI gpt-4o model to generate statements or questions whose correctness can be verified against that specific segment. License and Attribution The dataset is distributed under the CC-BY-NC-SA license: https://creativecommons.org/licenses/by-nc-sa/4.0/. When using this dataset, please cite as: Project "Analysis of the Applicability of Artificial Intelligence Methods in the Field of European Union Fund Projects" (VPP-CFLA-Mākslīgais intelekts-2024/1-0003). Dataset for Embedding Model Fine-Tuning. Licensed under CC BY-NC-SA 4.0. |
dc.language.iso | lav |
dc.publisher | University of Latvia |
dc.rights | Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0) |
dc.rights.uri | http://creativecommons.org/licenses/by-nc-sa/4.0/ |
dc.rights.label | PUB |
dc.source.uri | https://www.lzp.gov.lv/lv/projekts/maksliga-intelekta-metozu-piemerotibas-analize-eiropas-savienibas-fondu-projektu-joma |
dc.subject | Embedding Model |
dc.subject | Fine-Tuning |
dc.title | Embedding Model Fine-Tuning Dataset |
dc.type | corpus |
metashare.ResourceInfo#ContentInfo.mediaType | text |
has.files | yes |
branding | CLARIN Centre of Latvian language resources and tools |
contact.person | Raivis Skadiņš raivis.skadins@lu.lv University of Latvia |
sponsor | Latvian Council of Science VPP-CFLA-Mākslīgais intelekts-2024/1-0003 Mākslīgā intelekta metožu piemērotības analīze Eiropas Savienības fondu projektu jomā nationalFunds |
size.info | 432895 items |
files.size | 169057845 |
files.count | 4 |
Faili šajā vienumā
Lejupielādēt visus vienuma failus (161.23 MB)Šis vienums ir
Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)
Publicly Available
un ir licencēts saskaņā ar:Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)

- Vārds
- README.md
- Lielums
- 2.41 KB
- Formāts
- Nezināms
- Apraksts
- English readme file describing the dataset
- MD5
- 79428de63cf91b48c50e9193d4a95eec

- Vārds
- README-LV.md
- Lielums
- 2.47 KB
- Formāts
- Nezināms
- Apraksts
- Latvian readme file describing the dataset
- MD5
- cc70c10ab77ace2285103aa08720decc

- Vārds
- train_dataset.zip
- Lielums
- 160.18 MB
- Formāts
- application/zip
- Apraksts
- Training set
- MD5
- c892299c10f18de8dd40ca872edb0a4d

- Vārds
- val_dataset.zip
- Lielums
- 1.04 MB
- Formāts
- application/zip
- Apraksts
- Validation set
- MD5
- 52d694b5b7dc9b299f443fff39c9e888