• Repository
  • Corpus Search
  • About
  • CLARIN
  •  Login
  • English Latviešu
  • CLARIN-LV Repository Home
  • View Item
  •  
  • CLARIN-LV logo
  •   Browse  
    •    All of the Repository  
      •   Issue Date
      •   Authors
      •   Titles
      •   Subjects
      •   Publisher
      •   Language
      •   Type
      •   Rights Label
  •   My Account  
    •    Login
  •   Statistics  
    •    StatisticsBETA
  •   General Information  
    •    Deposit
    •    Cite
    •    Submission Lifecycle
    •    FAQ
    •    About
    •    Help Desk
 
 

Balanced Corpus of Modern Latvian (LVK2022)

 
CLARIN Centre of Latvian language resources and tools
  Authors
Levāne-Petrova, Kristīne ; Darģis, Roberts ; Pokratniece, Kristīne and Lasmanis, Viesturs Jūlijs
  Item identifier
http://hdl.handle.net/20.500.12574/84
 Project URL
https://korpuss.lv/id/LVK2022
 Demo URL
https://nosketch.korpuss.lv/#dashboard?corpname=LVK2022
 Date issued
2023
 Type
corpus, text
 Size
122877749 tokens
 Language(s)
Latvian
 Description
The Balanced Corpus of Modern Latvian, which contains unique texts not yet included in other so far developed balanced corpora (LVK2013 and LVK2018). The corpus is primarily based on the design principles of previous balanced corpora. It contains authentic contemporary texts (mostly created after 2000) of various genres with metadata. Unlike its predecessors, this balanced corpus contains texts in the original language as well as translations. When selecting the texts to be included in the corpus from the web, first all current pages from one domain are collected and the content corresponding to the corpus is retrieved. The next processing step consisted of dividing the text into paragraphs and deleting duplicates or paragraphs irrelevant to the corpus (texts in foreign languages, tables, etc.). Paragraphs in some fiction documents have been rearranged alphabetically to comply with the contractual obligations to publishing companies. The balanced corpus has been comprised of the processed documents according to the following proportions of language genres: journalism (60%), fiction (10%), scientific (10%), Wikipedia (7%), legal (7%), parliamentary transcripts (3%) and subtitles (3%).
 Publisher
AiLab IMCS UL
 Acknowledgement

Latvian Language Agency

Project code: grant agreement No. 4.6/2019-029

Project name: Enlargement and Development of the Latvian National Text Corpus

 Subject(s)
text general representative morphology reference corpus
 Collection(s)
Language resources and tools of AiLab IMCS UL
Show full item record
 
 

Partners, Coordination, Funding

  • Institute of Mathematics and Computer Science of the University of Latvia
  • Institute of Literature, Folklore and Art of the University of Latvia
  • University of Latvia
  • Rīga Stradiņš University
  • RTU Liepaja
  • Rezekne Academy of Technologies
  • National Library of Latvia

Repository

  • Main page
  • Contact
  • Submission Lifecycle
  • FAQ
  • About and Policies

More

  • CLARIN
  • How to Sign in

This platform runs under the software developed for the LINDAT/CLARIN repository for linguistics , available on GitHub