dc.contributor.author | Deksne, Daiga |
dc.date.accessioned | 2025-09-16T14:56:25Z |
dc.date.available | 2025-09-16T14:56:25Z |
dc.date.issued | 2025-08-27 |
dc.identifier.uri | http://hdl.handle.net/20.500.12574/136 |
dc.description | Dataset for Embedding Model Fine-Tuning has been created within the framework of the National Research Program project "Analysis of the applicability of artificial intelligence methods in the field of EU fund projects". For the purposes of this project, we fine-tuned the bge-m3 model developed by BAAI (Chen et al., 2024). For fine-tuning, we collected DOCX procurement documents from the Electronic Procurement System (https://www.eis.gov.lv/EKEIS/Supplier/), allocating 7,083 files to the training set and 50 files to the validation set. The text from these documents was extracted and segmented. For each text segment, we used the OpenAI gpt-4o model to generate statements or questions whose correctness can be verified against that specific segment. License and Attribution The dataset is distributed under the CC-BY-NC-SA license: https://creativecommons.org/licenses/by-nc-sa/4.0/. When using this dataset, please cite as: Project "Analysis of the Applicability of Artificial Intelligence Methods in the Field of European Union Fund Projects" (VPP-CFLA-Mākslīgais intelekts-2024/1-0003). Dataset for Embedding Model Fine-Tuning. Licensed under CC BY-NC-SA 4.0. |
dc.language.iso | lav |
dc.publisher | University of Latvia |
dc.rights | Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0) |
dc.rights.uri | http://creativecommons.org/licenses/by-nc-sa/4.0/ |
dc.rights.label | PUB |
dc.source.uri | https://www.lzp.gov.lv/lv/projekts/maksliga-intelekta-metozu-piemerotibas-analize-eiropas-savienibas-fondu-projektu-joma |
dc.subject | Embedding Model |
dc.subject | Fine-Tuning |
dc.title | Embedding Model Fine-Tuning Dataset |
dc.type | corpus |
metashare.ResourceInfo#ContentInfo.mediaType | text |
has.files | yes |
branding | CLARIN Centre of Latvian language resources and tools |
contact.person | Raivis Skadiņš raivis.skadins@lu.lv University of Latvia |
sponsor | Latvian Council of Science VPP-CFLA-Mākslīgais intelekts-2024/1-0003 Mākslīgā intelekta metožu piemērotības analīze Eiropas Savienības fondu projektu jomā nationalFunds |
size.info | 432895 items |
files.size | 169057845 |
files.count | 4 |
Files in this item
Download all files in item (161.23 MB)This item is
Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)
Publicly Available
and licensed under:Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)

- Name
- README.md
- Size
- 2.41 KB
- Format
- Unknown
- Description
- English readme file describing the dataset
- MD5
- 79428de63cf91b48c50e9193d4a95eec

- Name
- README-LV.md
- Size
- 2.47 KB
- Format
- Unknown
- Description
- Latvian readme file describing the dataset
- MD5
- cc70c10ab77ace2285103aa08720decc

- Name
- train_dataset.zip
- Size
- 160.18 MB
- Format
- application/zip
- Description
- Training set
- MD5
- c892299c10f18de8dd40ca872edb0a4d

- Name
- val_dataset.zip
- Size
- 1.04 MB
- Format
- application/zip
- Description
- Validation set
- MD5
- 52d694b5b7dc9b299f443fff39c9e888