Sinhala/Tamil NLP benchmark suite

Lanka NLP Bench

A reproducible Sri Lanka-focused benchmark project
with transparent data cards, source manifests,
baseline scripts, and explicit limitations.

Pilot entities
300
Retrieval records
600
CI status
Passing
Candidate intent records
24
Baseline families
4

Current Datasets

Task Status License
Wikidata entity-name retrieval Pilot distant supervision CC0-1.0
Service intent smoke fixture Pipeline test only CC-BY-4.0 fixture
Service intent v0 seed Needs human annotation CC-BY-4.0 candidate

Baseline Results

Top-1
0.10
Top-5
0.20
MRR
0.1891

Research Workflow

Stage Status
Source manifest and licensing Implemented
Annotation and adjudication tooling Implemented
Human-adjudicated gold data Requires human review

Baseline Families

  • TF-IDF + Logistic Regression classification
  • Character TF-IDF cross-lingual retrieval
  • XLM-R / mBERT sequence classification scripts
  • XLM-R / mBERT mean-pooled retrieval scripts

Reproduce Locally

uv sync --extra dev
uv run python scripts/build_wikidata_lk_entities.py --limit 300
uv run python scripts/validate_project.py
uv run pytest
uv run python scripts/run_retrieval_baseline.py --task wikidata_lk_entity_name_retrieval --split test --output reports/pilot/char_tfidf_wikidata_lk_entity_name_retrieval.json

Claim Boundary

This project is positioned as a unified Sri Lanka-focused Sinhala/Tamil benchmark suite with reproducible baselines. It does not claim to be the first Sinhala or Tamil benchmark, and the Wikidata pilot is not human-adjudicated gold data.