Current Datasets
| Task | Status | License |
|---|---|---|
| Wikidata entity-name retrieval | Pilot distant supervision | CC0-1.0 |
| Service intent smoke fixture | Pipeline test only | CC-BY-4.0 fixture |
| Service intent v0 seed | Needs human annotation | CC-BY-4.0 candidate |
Sinhala/Tamil NLP benchmark suite
A reproducible Sri Lanka-focused benchmark project
with transparent data cards, source manifests,
baseline scripts, and explicit limitations.
| Task | Status | License |
|---|---|---|
| Wikidata entity-name retrieval | Pilot distant supervision | CC0-1.0 |
| Service intent smoke fixture | Pipeline test only | CC-BY-4.0 fixture |
| Service intent v0 seed | Needs human annotation | CC-BY-4.0 candidate |
| Stage | Status |
|---|---|
| Source manifest and licensing | Implemented |
| Annotation and adjudication tooling | Implemented |
| Human-adjudicated gold data | Requires human review |
uv sync --extra dev
uv run python scripts/build_wikidata_lk_entities.py --limit 300
uv run python scripts/validate_project.py
uv run pytest
uv run python scripts/run_retrieval_baseline.py --task wikidata_lk_entity_name_retrieval --split test --output reports/pilot/char_tfidf_wikidata_lk_entity_name_retrieval.json
This project is positioned as a unified Sri Lanka-focused Sinhala/Tamil benchmark suite with reproducible baselines. It does not claim to be the first Sinhala or Tamil benchmark, and the Wikidata pilot is not human-adjudicated gold data.