We present a collection of open, machine-readable document datasets covering parliamentary proceedings, legal judgments, government publications, news, and tourism statistics from Sri Lanka. The collection currently comprises of 229,858 documents (57.1 GB) across 24 datasets in Sinhala, Tamil, and English. The datasets are updated daily and mirrored on GitHub and Hugging Face. These resources aim to support research in computational linguistics, legal analytics, socio-political studies, and multilingual natural language processing. We describe the data sources, collection pipeline, formats, and potential use cases, while discussing licensing and ethical considerations. This manuscript is at version v2025-10-15-1111.
Sri Lanka Document Datasets: A Large-Scale, Multilingual Resource for Law, News, and Policy
- Paper ID: 2510.04124
- Title: Sri Lanka Document Datasets: A Large-Scale, Multilingual Resource for Law, News, and Policy
- Author: Nuwan I. Senaratna (Independent Researcher)
- Classification: cs.CL (Computational Linguistics)
- Publication Date: arXiv preprint, v2025-10-16-0818
- Paper Link: https://arxiv.org/abs/2510.04124
This paper introduces a large-scale, open-source, machine-readable collection of Sri Lankan documents, encompassing parliamentary records, legal judgments, government publications, news articles, and tourism statistics. The collection currently contains 230,091 documents (57.7 GB) spanning 24 datasets, supporting three languages: Sinhala, Tamil, and English. The datasets are updated daily and mirrored on GitHub and Hugging Face. These resources aim to support research in computational linguistics, legal analysis, sociopolitical studies, and multilingual natural language processing.
Digitized legal, policy, and media records in Sri Lanka are scattered across numerous government and private sources, with most information existing in PDF or webpage formats lacking machine-readable structure or consistent public archiving. This fragmentation limits citizens', journalists', and researchers' access to information about the country's governance, history, and socioeconomic trends.
- Data Scarcity: South Asia, particularly Sri Lanka, lacks unified and machine-readable public records documentation
- Language Diversity: Need for NLP research supporting low-resource languages (Sinhala, Tamil)
- Transparency Requirements: Enhanced transparency and verifiability for citizen engagement and academic research
- Cross-Domain Applications: Support for legal analysis, policy research, media monitoring, and other fields
- Global large-scale corpora (e.g., Common Crawl, Wikipedia Dumps) are predominantly dominated by high-resource language data
- Regional initiatives are fragmented and typically focus on individual media outlets or institutions
- Previous datasets have limitations in scale, language coverage, or temporal continuity
- Construction of Large-Scale Multilingual Document Collection: 230,091 documents spanning 24 distinct datasets
- Establishment of Automated Data Collection Pipeline: Enabling continuous discovery, ingestion, parsing, validation, and version control
- Provision of Open-Access Data Infrastructure: Fully open datasets under MIT license
- Support for Multi-Domain Research Applications: Computational linguistics, legal analysis, sociopolitical research, and others
- Assurance of Data Quality and Reproducibility: Standardized formats, version control, and transparent data provenance
The paper provides detailed descriptions of 24 datasets, primarily categorized as follows:
- Hansard (Parliamentary Records): 1,665 documents, 17.9 GB, 2006-2025
- Court of Appeal Judgments: 10,164 documents, 10.5 GB, 2012-2025
- Supreme Court Judgments: 2,168 documents, 1.4 GB, 2009-2025
- Legislation: 3,934 documents, 6.9 GB, 1981-2025
- Bills: 4,080 documents, 1.9 GB, 2010-2025
- Gazette Extraordinary (2020s): 45,373 documents, 1.3 GB
- Gazette Extraordinary (2010s): 56,379 documents, 3.3 GB
- Cabinet Decisions: 10,385 documents, 136.4 MB
- Ministry of Finance Press Releases: 134 documents, 144.5 MB
- News Documents: 81,155 documents, 1.2 GB, 2021-2025
- Presidential Media Division Press Releases: 2,182 documents, 55.9 MB
- Tourism Statistics Reports: 161 documents, 405.7 MB
- Fisheries Statistics Reports: 417 documents, 101.4 MB
- Central Bank Annual Reports: 1,137 documents, 3.5 GB
- GitHub Actions Orchestration: Daily multiple runs using cron jobs
- Matrix Strategy: Isolation of each data source allowing independent retries
- Incremental Updates: Detection of new or modified items through stable keys (URL + date) and content hashing
- Tools: Python + Selenium + headless Chrome browser
- Dynamic Content Handling: Explicit conditional waits for dynamic content loading
- Politeness Constraints: Compliance with robots.txt, request rate limiting, and randomized delays
- PDF Parsing: Text, metadata, and layout block extraction using PyMuPDF
- Quality Control: Schema validation, mandatory field enforcement, checksum protection
- Version Control: Preservation of original artifacts and parsed JSON representations
- Automated Pipeline: Fully automated data collection, processing, and update workflow
- Multi-Format Support: Simultaneous handling of HTML and PDF document formats
- Incremental Update Mechanism: Efficient change detection and version control
- Quality Assurance: Multi-layered data validation and error handling
- Transparency Design: Complete metadata recording and auditable data provenance
- Total Documents: 230,091
- Total Size: 57.7 GB
- Number of Datasets: 24
- Language Coverage: Sinhala, Tamil, English
- Temporal Span: 1950-2025 (varies by dataset)
- Completeness Checks: Mandatory field validation
- Consistency Validation: Format standardization
- Duplicate Detection: Content hash-based deduplication
- Temporal Validity: Date range verification
| Category | Document Count | Data Size | Primary Language |
|---|
| Legal Documents | 62,314 | 36.7 GB | Primarily English |
| Government Publications | 112,473 | 5.0 GB | Multilingual |
| News Media | 83,337 | 1.3 GB | Multilingual |
| Statistical Reports | 5,742 | 14.7 GB | Primarily English |
- Historical Depth: Earliest documents traceable to 1950 (Central Bank annual reports)
- Update Frequency: Daily automatic updates
- Data Freshness: Most datasets cover through October 2025
- English: Primary language for official government documents and legal judgments
- Sinhala: Local news and certain government documents
- Tamil: Minority language documentation
- Common Crawl: General web crawling data
- Wikipedia Dumps: Wikipedia data dumps
- OpenWebText: Open web text corpus
- Indian Kanoon: Indian legal corpus
- OpenSubtitles: Multilingual subtitle dataset
- African News Corpus: African news corpus
- Existing efforts are fragmented and typically focus on individual media institutions
- Lack of comprehensive and machine-readable document records
- Limitations in scale, language coverage, or temporal continuity
- Successfully constructed Sri Lanka's largest-scale multilingual document dataset
- Established a sustainable automated data collection and update mechanism
- Provided valuable resources for computational linguistics and digital governance research
- Ensured data accessibility and reusability through open licensing
- Language Processing Accuracy: Parsing accuracy for Sinhala and Tamil requires improvement
- OCR Capability Constraints: Limited ability to handle scanned or unstructured PDFs
- Coverage Scope: Certain government institutions and media sources remain unincluded
- Data Quality Variance: Quality variations exist across different sources
- Expanded Coverage: Addition of more government institutions, media sources, and historical archives
- Enhanced Language Processing: Improved tokenization, font handling, and multilingual embeddings for Sinhala and Tamil
- Integrated OCR Parsing: Experimentation with deep learning-based OCR pipelines combined with layout recognition and language modeling
- Data Scale and Quality: Large-scale dataset of 230,091 documents covering multiple important domains
- Excellent Technical Implementation: Fully automated data pipeline ensuring timeliness and consistency
- Openness and Transparency: Complete open access under MIT license, adhering to FAIR principles
- Multilingual Support: Valuable resources for low-resource language research
- High Practical Value: Addresses actual application needs across multiple research domains
- Lack of Evaluation: Absence of quantitative assessment and validation of data quality
- Insufficient Application Examples: No concrete use cases or benchmark results provided
- Uneven Language Distribution: English documents dominate with relatively limited coverage of other languages
- Insufficient Technical Detail: Certain technical implementation details lack sufficient depth
- Academic Contribution: Establishes foundation for digital humanities and computational linguistics research in South Asia
- Social Value: Enhances government transparency and supports citizen engagement and oversight
- Technical Exemplar: Provides reference for similar data infrastructure development in other developing countries
- Sustainability: Establishes sustainable data collection and maintenance mechanisms
- Natural Language Processing: Multilingual model training and evaluation
- Legal Technology: Legal document analysis and case law research
- Policy Analysis: Government decision tracking and policy change monitoring
- Media Research: News trend analysis and sentiment analysis
- Digital Governance: E-government and transparency research
The paper cites important works across relevant fields, including:
- Best practices in MLOps and data pipeline construction
- Open data governance frameworks
- Ethical and technical standards for web crawling
- FAIR principles for scientific data management
- Literature on research reproducibility
Overall Assessment: This is a dataset paper with significant practical value, providing valuable infrastructure for digitized research in Sri Lanka and South Asia more broadly. While relatively limited in technical innovation, its contributions in data scale, openness, and sustainability are noteworthy. This work sets a positive example for digital humanities research in low-resource languages and developing countries.