Evaluation and development data sets for speech translation for meetings Authors: Pinnis, Mārcis ; Pole, Megija ; Kapočiūtė-Dzikienė, Jurgita ; Nicmanis, Dāvis ; Salimbajevs, Askars ; Skadiņš, Raivis ; Miķelsons, Mārtiņš ; Lizanders, Kristaps ; Bērziņš, Aivars ; Vasiļevskis, Artūrs ; Rozis, Roberts; Kornikaite, Nida License: Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) Description of data formats The repository consists of the following: 1) "Speech annotation guidelines - public.docx" - Guidelines that the annotators used to annotate the evaluation and development data sets for speech translation for meetings 2) "lt-en.zip" - the Lithuanian->English evaluation and development data 3) "lv-en-dev.zip" - the Latvian->English development data 4) "lv-en-eval.zip" - the Latvian->English evaluation data 5) "en-lv-dev.zip" - the English->Latvian development data 6) "en-lv-eval.zip" - the English->Latvian evaluation data The directory structure of the ZIP archives is as follows: 1) "lt-en.zip": dev *.antx *.wav eval *.antx *.wav lt_data_dev.tsv lt_data_eval.tsv 2) "lv-en-dev.zip": dev *.antx *.wav inter-annotator-agreement a1 *.antx a2 *.antx lv_data_dev.tsv 3) "lv-en-eval.zip": eval *.antx *.wav lv_data_eval.tsv 4) "en-lv-dev.zip": dev *.antx *.wav en_data_dev.tsv 5) "en-lv-eval.zip": eval *.antx *.wav en_data_eval.tsv Each audio file (in the WAV data format) has been annotated by one annotator. The annotation was performed using Annotation Pro (http://annotationpro.org/). Annotations for each annotated WAV file were saved in an Annotation Pro XML Annotation File (ANTX file). The file names for each *antx and *wav file pair are equal. Then, for each translation direction, we have also extracted all annotations and stored them in tab-separated values (TSV) files (i.e., the [lt|en|lv]_data[dev|eval].tsv files). Each line in the TSV files corresponds to one annotated segment/sentence. Each segment/sentence consists of the following tab-separated values: 1) Name of the WAV/ANTX file (without file extension). 2) ID of the annotator (annotator IDs are reset for each translation direction). 3) Start time (in seconds) of the segment (in the particular WAV file). 4) End time (in seconds) of the segment (in the particular WAV file). 5) ID of the speaker (speaker IDs are reset for each WAV file; i.e., they always start from S1, S2, ...). 6) Orthographic transcription. 7) Normalised transcription (with spoken language slang and mispronounced words normalised into written language). 8) Trimmed transcription (with verbal noise, partially spoken words, phenomena non-existing in written language deleted). 9) Reordered transcription (such that the word order would better adhere to the correct word order of the source language). 10) Translation (either into English (for Latvian and Lithuanian source data) or Latvian (for English source data)).