Show simple item record

 
dc.contributor.author Darģis, Roberts
dc.contributor.author Znotiņš, Artūrs
dc.contributor.author Auziņa, Ilze
dc.contributor.author Rābante-Buša, Guna
dc.date.accessioned 2024-03-25T14:44:20Z
dc.date.available 2024-03-25T14:44:20Z
dc.date.issued 2024-03
dc.identifier.uri http://hdl.handle.net/20.500.12574/99
dc.description A Latvian speech corpus for the development (validation), testing and comparison of ASR models. The audio data is segmented and aligned with the corresponding orthographic transcriptions which are human verified. The LATE-media subset contains both verbatim (raw) and formatted transcriptions (with punctuation, capitalisation, numbers, abbreviations, etc.), while the LATE-conversations subset currently contains only verbatim transcriptions (no punctuation, capitalisation, etc.). The dataset consists of: - 5 hours of broadcast media recordings, both spontaneous and prepared speech (2.5h dev set, 2.5h test set); - 5 hours of conversational speech recordings, spontaneous speech (2.5h dev set, 2.5h test set).
dc.language.iso lav
dc.publisher AiLab IMCS UL
dc.relation.isreferencedby https://korpuss.lv/id/LATE-mediji
dc.relation.isreferencedby https://korpuss.lv/id/LATE-sarunas
dc.rights CLARIN ACA
dc.rights.uri https://www.kielipankki.fi/wp-content/uploads/CLARIN_ACA_AFFIL-EDU_NC_NORED_en.html
dc.rights.label ACA
dc.source.uri http://www.digitalhumanities.lv/projects/vpp-late/
dc.subject ASR
dc.title LATE Dev&Test Set V1 for Latvian ASR
dc.type corpus
metashare.ResourceInfo#ContentInfo.mediaType audio
has.files yes
branding CLARIN Centre of Latvian language resources and tools
demo.uri https://late.ailab.lv
contact.person Ilze Auziņa ilze.auzina@lumii.lv IMCS at University of Latvia
sponsor Ministry of Education and Science VPP-LETONIKA-2021/1-0006 Research on Modern Latvian Language and Development of Language Technology nationalFunds
size.info 10 hours
files.size 1002359686
files.count 4


 Files in this item

 Download all files in item (955.92 MB)
This item is
Academic Use
and licensed under:
CLARIN ACA
Noncommercial
Icon
Name
late-media-v1-test.zip
Size
234.72 MB
Format
application/zip
Description
A test set of orthographically transcribed speech segments from media content. The verbatim and formatted transcriptions are stored in a self-explanatory JSON file.
MD5
b312198af484c23a5cbc74e04388991e
 Download file
Icon
Name
late-conversations-v1-test.zip
Size
241.42 MB
Format
application/zip
Description
A test set of orthographically transcribed conversational speech segments. The verbatim and formatted transcriptions are stored in a self-explanatory JSON file.
MD5
ed9085d33f715ec6a8d2ab60a0816218
 Download file
Icon
Name
late-conversations-v1-dev.zip
Size
244.9 MB
Format
application/zip
Description
A development set of orthographically transcribed conversational speech segments. The verbatim and formatted transcriptions are stored in a self-explanatory JSON file.
MD5
e5c32717cf925d04989876426e51fb9c
 Download file
Icon
Name
late-media-v1-dev.zip
Size
234.88 MB
Format
application/zip
Description
A development set of orthographically transcribed speech segments from media content. The verbatim and formatted transcriptions are stored in a self-explanatory JSON file.
MD5
fd0632512e4fea6b503c3cb734862722
 Download file

Show simple item record