LATE Dev&Test Set V1 for Latvian ASR

Name: LATE Dev&Test Set V1 for Latvian ASR
License: https://www.kielipankki.fi/wp-content/uploads/CLARIN_ACA_AFFIL-EDU_NC_NORED_en.html
Keywords: ASR

Darģis, Roberts; Znotiņš, Artūrs; Auziņa, Ilze; Rābante-Buša, Guna

Rādīt vienkāršu vienuma ierakstu

dc.contributor.author	Darģis, Roberts
dc.contributor.author	Znotiņš, Artūrs
dc.contributor.author	Auziņa, Ilze
dc.contributor.author	Rābante-Buša, Guna
dc.date.accessioned	2024-03-25T14:44:20Z
dc.date.available	2024-03-25T14:44:20Z
dc.date.issued	2024-03
dc.identifier.uri	http://hdl.handle.net/20.500.12574/99
dc.description	A Latvian speech corpus for the development (validation), testing and comparison of ASR models. The audio data is segmented and aligned with the corresponding orthographic transcriptions which are human verified. The LATE-media subset contains both verbatim (raw) and formatted transcriptions (with punctuation, capitalisation, numbers, abbreviations, etc.), while the LATE-conversations subset currently contains only verbatim transcriptions (no punctuation, capitalisation, etc.). The dataset consists of: - 5 hours of broadcast media recordings, both spontaneous and prepared speech (2.5h dev set, 2.5h test set); - 5 hours of conversational speech recordings, spontaneous speech (2.5h dev set, 2.5h test set).
dc.language.iso	lav
dc.publisher	AiLab IMCS UL
dc.relation.isreferencedby	https://korpuss.lv/id/LATE-mediji
dc.relation.isreferencedby	https://korpuss.lv/id/LATE-sarunas
dc.rights	CLARIN ACA
dc.rights.uri	https://www.kielipankki.fi/wp-content/uploads/CLARIN_ACA_AFFIL-EDU_NC_NORED_en.html
dc.rights.label	ACA
dc.source.uri	http://www.digitalhumanities.lv/projects/vpp-late/
dc.subject	ASR
dc.title	LATE Dev&Test Set V1 for Latvian ASR
dc.type	corpus
metashare.ResourceInfo#ContentInfo.mediaType	audio
has.files	yes
branding	CLARIN Centre of Latvian language resources and tools
demo.uri	https://late.ailab.lv
contact.person	Ilze Auziņa ilze.auzina@lumii.lv IMCS at University of Latvia
sponsor	Ministry of Education and Science VPP-LETONIKA-2021/1-0006 Research on Modern Latvian Language and Development of Language Technology nationalFunds
size.info	10 hours
files.size	1002359686
files.count	4

Faili šajā vienumā

Lejupielādēt visus vienuma failus (955.92 MB)

Šis vienums ir

Academic Use

un ir licencēts saskaņā ar:
CLARIN ACA

Vārds: late-media-v1-test.zip
Lielums: 234.72 MB
Formāts: application/zip
Apraksts: A test set of orthographically transcribed speech segments from media content. The verbatim and formatted transcriptions are stored in a self-explanatory JSON file.
MD5: b312198af484c23a5cbc74e04388991e

Lejupielādēt failu

Vārds: late-conversations-v1-test.zip
Lielums: 241.42 MB
Formāts: application/zip
Apraksts: A test set of orthographically transcribed conversational speech segments. The verbatim and formatted transcriptions are stored in a self-explanatory JSON file.
MD5: ed9085d33f715ec6a8d2ab60a0816218

Lejupielādēt failu

Vārds: late-conversations-v1-dev.zip
Lielums: 244.9 MB
Formāts: application/zip
Apraksts: A development set of orthographically transcribed conversational speech segments. The verbatim and formatted transcriptions are stored in a self-explanatory JSON file.
MD5: e5c32717cf925d04989876426e51fb9c

Lejupielādēt failu

Vārds: late-media-v1-dev.zip
Lielums: 234.88 MB
Formāts: application/zip
Apraksts: A development set of orthographically transcribed speech segments from media content. The verbatim and formatted transcriptions are stored in a self-explanatory JSON file.
MD5: fd0632512e4fea6b503c3cb734862722

Lejupielādēt failu

Rādīt vienkāršu vienuma ierakstu

Faili šajā vienumā

Partneri, koordinācija, finansējums

Repozitorijs

Papildus