Show simple item record

 
dc.contributor.author Lasmanis, Viesturs Jūlijs
dc.contributor.author Grūzītis, Normunds
dc.date.accessioned 2023-05-29T08:28:48Z
dc.date.available 2023-05-29T08:28:48Z
dc.date.issued 2023-05
dc.identifier.uri http://hdl.handle.net/20.500.12574/85
dc.description The CSV dataset contains sentence pairs for a text-to-text transformation task: given a sentence that contains 0..n abbreviations, rewrite (normalize) the sentence in full words (word forms). Training dataset: 64,665 sentence pairs Validation dataset: 7,185 sentence pairs. Testing dataset: 7,984 sentence pairs. All sentences are extracted from a public web corpus (https://korpuss.lv/id/Tīmeklis2020) and contain at least one medical term.
dc.language.iso lav
dc.publisher AiLab IMCS UL
dc.rights Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
dc.rights.uri http://creativecommons.org/licenses/by-sa/4.0/
dc.rights.label PUB
dc.source.uri https://gitlab.com/ailab/lvmed
dc.subject medical domain
dc.subject text normalization
dc.title LVMED: Dataset of Latvian text normalisation samples for the medical domain
dc.type corpus
metashare.ResourceInfo#ContentInfo.mediaType text
has.files yes
branding CLARIN Centre of Latvian language resources and tools
demo.uri https://huggingface.co/AiLab-IMCS-UL/lvmed
contact.person Viesturs Jūlijs Lasmanis viesturs.lasmanis@lumii.lv Institute of Mathematics and Computer Science, University of Latvia
sponsor Recovery and Resilience Facility (RRF) 2.3.1.1.i.0/1/22/I/CFLA/002 Language Technology Initiative euFunds
size.info 79834 sentences
files.size 7054499
files.count 1


 Files in this item

This item is
Publicly Available
and licensed under:
Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
Icon
Name
dataset.zip
Size
6.73 MB
Format
application/zip
Description
Sentence pairs provided in a two-column CSV format.
MD5
4f02442c2684d7a64cea9bdc8ddcde3b
 Download file  Preview
 File Preview  
  • dataset
    • train.csv23 MB
    • test.csv2 MB
    • statistics
      • term_counter.csv8 kB
      • abbrev_counter.csv6 kB
      • pair_counter.csv14 kB
    • valid.csv2 MB

Show simple item record