Sinhala/Tamil NLP benchmark suite

Lanka NLP Bench

A reproducible Sri Lanka-focused benchmark project
with transparent data cards, source manifests,
baseline scripts, and explicit limitations.

GitHub Technical report

Pilot entities: 300
Retrieval records: 600
CI status: Passing
Candidate intent records: 24
Baseline families: 4

Current Datasets

Task	Status	License
Wikidata entity-name retrieval	Pilot distant supervision	CC0-1.0
Service intent smoke fixture	Pipeline test only	CC-BY-4.0 fixture
Service intent v0 seed	Needs human annotation	CC-BY-4.0 candidate

Baseline Results

Top-1

0.10

Top-5

0.20

MRR

0.1891

Research Workflow

Stage	Status
Source manifest and licensing	Implemented
Annotation and adjudication tooling	Implemented
Human-adjudicated gold data	Requires human review

Baseline Families

TF-IDF + Logistic Regression classification
Character TF-IDF cross-lingual retrieval
XLM-R / mBERT sequence classification scripts
XLM-R / mBERT mean-pooled retrieval scripts

Reproduce Locally

uv sync --extra dev
uv run python scripts/build_wikidata_lk_entities.py --limit 300
uv run python scripts/validate_project.py
uv run pytest
uv run python scripts/run_retrieval_baseline.py --task wikidata_lk_entity_name_retrieval --split test --output reports/pilot/char_tfidf_wikidata_lk_entity_name_retrieval.json

Claim Boundary

This project is positioned as a unified Sri Lanka-focused Sinhala/Tamil benchmark suite with reproducible baselines. It does not claim to be the first Sinhala or Tamil benchmark, and the Wikidata pilot is not human-adjudicated gold data.