# ConLoan-LV: A Contrastive Dataset for Latvian Language Loanwords, Code-switching, and Named Entities

## Description
ConLoan-LV is a machine-readable, sentence-context-based dataset for the
detection and analysis of non-native lexis in the Latvian language. It
replicates the ConLoan contrastive methodology (Ahmadi et al., 2025) and
extends it with contrastive classes (Code-switching and Named Entities) to
reduce model confusion in known error cases.

## Dataset Structure
The dataset is provided in JSON format. Each object represents a single
sentence with the following fields:

- `source_annotated_loanwords`: Sentence with inline XML-style tags for the
  target categories.
- `source_annotated_loanwords_replaced`: Sentence where loanwords are replaced
  with native equivalents.
- `target`: Semantic equivalent/translation in English.
- `source_plain`: The original sentence sourced from the LVK2022 corpus.
- `source_annotated_plain`: The original sentence with loanwords replaced where
  possible.
- `words_in_L_tags`: Dictionary mapping tag IDs to specific loanwords.
- `words_in_N_tags`: Dictionary mapping tag IDs to specific native replacements.
- `corresponding_words`: Mapping of loanwords to their native replacements.

## Annotation Labels
- `<Lx>`: Material borrowings (Loanwords).
- `<CSx>`: Intra-sentential code-switching.
- `<NEx>`: Named entities (locations, organizations, persons, etc).

## Dataset variants 
- `Latvian.json`: Baseline dataset containing only `<Lx>` labels (353 sentences).
- `Latvian_ext.json`: Extended dataset containing `<Lx>`, `<CSx>`, and `<NEx>` labels (676 sentences).

## Composition and Sources
- Sentences were sourced from LVK2022 (Balanced Corpus of Modern Latvian).
- Candidates were selected from Wiktionary (Latvian dump) and "Latvian
  Etymological Dictionary" (K. Karulis).
- All sentences underwent manual expert validation for linguistic accuracy and
  label consistency.

## Usage for NLP
The dataset is optimized for token classification tasks (NER-style). We
recommend using the BIO (Beginning, Inside, Outside) tagging scheme for
training transformer-based models (e.g., BERT, XLM-R).

## Citation
If you use this dataset, please cite the associated bachelor's thesis:
Štekeļs, J. (2026). *Kontekstuāla pieeja latviešu valodas aizguvumu noteikšanā:
datu kopas veidošana un klasifikācijas eksperimenti*. University of Latvia.
