dc.contributor.author | Lasmanis, Viesturs Jūlijs |
dc.contributor.author | Grūzītis, Normunds |
dc.date.accessioned | 2023-05-29T08:28:48Z |
dc.date.available | 2023-05-29T08:28:48Z |
dc.date.issued | 2023-05 |
dc.identifier.uri | http://hdl.handle.net/20.500.12574/85 |
dc.description | The CSV dataset contains sentence pairs for a text-to-text transformation task: given a sentence that contains 0..n abbreviations, rewrite (normalize) the sentence in full words (word forms). Training dataset: 64,665 sentence pairs Validation dataset: 7,185 sentence pairs. Testing dataset: 7,984 sentence pairs. All sentences are extracted from a public web corpus (https://korpuss.lv/id/Tīmeklis2020) and contain at least one medical term. |
dc.language.iso | lav |
dc.publisher | AiLab IMCS UL |
dc.rights | Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) |
dc.rights.uri | http://creativecommons.org/licenses/by-sa/4.0/ |
dc.rights.label | PUB |
dc.source.uri | https://gitlab.com/ailab/lvmed |
dc.subject | medical domain |
dc.subject | text normalization |
dc.title | LVMED: Dataset of Latvian text normalisation samples for the medical domain |
dc.type | corpus |
metashare.ResourceInfo#ContentInfo.mediaType | text |
has.files | yes |
branding | CLARIN Centre of Latvian language resources and tools |
demo.uri | https://huggingface.co/AiLab-IMCS-UL/lvmed |
contact.person | Viesturs Jūlijs Lasmanis viesturs.lasmanis@lumii.lv Institute of Mathematics and Computer Science, University of Latvia |
sponsor | Recovery and Resilience Facility (RRF) 2.3.1.1.i.0/1/22/I/CFLA/002 Language Technology Initiative euFunds |
size.info | 79834 sentences |
files.size | 7054499 |
files.count | 1 |
Files in this item
This item is
Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
Publicly Available
and licensed under:Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
- Name
- dataset.zip
- Size
- 6.73 MB
- Format
- application/zip
- Description
- Sentence pairs provided in a two-column CSV format.
- MD5
- 4f02442c2684d7a64cea9bdc8ddcde3b
- dataset
- train.csv23 MB
- test.csv2 MB
- statistics
- term_counter.csv8 kB
- abbrev_counter.csv6 kB
- pair_counter.csv14 kB
- valid.csv2 MB