Tilde Language resources and tools

Tilde Language resources and tools http://hdl.handle.net/20.500.12574/73 2026-05-06T07:45:10Z 2026-05-06T07:45:10Z Evaluation and development data sets for speech translation for meetings Pinnis, Mārcis Pole, Megija Kapočiūtė-Dzikienė, Jurgita Nicmanis, Dāvis Salimbajevs, Askars Skadiņš, Raivis Miķelsons, Mārtiņš Lizanders, Kristaps Bērziņš, Aivars Vasiļevskis, Artūrs Rozis, Roberts Kornikaite, Nida http://hdl.handle.net/20.500.12574/74 2022-12-20T11:59:54Z 2022-12-09T00:00:00Z

Evaluation and development data sets for speech translation for meetings Pinnis, Mārcis; Pole, Megija; Kapočiūtė-Dzikienė, Jurgita; Nicmanis, Dāvis; Salimbajevs, Askars; Skadiņš, Raivis; Miķelsons, Mārtiņš; Lizanders, Kristaps; Bērziņš, Aivars; Vasiļevskis, Artūrs; Rozis, Roberts; Kornikaite, Nida The evaluation and development data sets for speech translation for meetings were created within the microproject "Multi-layer evaluation sets for speech translation of web-based meetings" of the project "HumanE AI Network". The data sets feature recordings of various public domain (public administration organised, publicly disseminated) meetings in English, Latvian, and Lithuanian, their transcriptions and translations into Latvian and English. The data sets feature multiple layers of annotation - raw orthographic transcription, normalised transcription (with spoken language words/phrases replaced with equivalents from written language), truncated transcription (with spoken language elements that have no equivalents in written language deleted), reordered transcription (with words reordered to better adhere to syntax norms of written language), and translation. The English and Latvian data were annotated by linguistics students. The Lithuanian data were annotated by a professional linguist. The data is intended for the development and evaluation purposes of speech translation systems and various components involved in pipeline-based speech translation systems (speaker diarisation, speech segmentation, automatic speech recognition, punctuation restoration, spoken language normalisation, and machine translation).

2022-12-09T00:00:00Z