2025-11-19T12:46:13.574656

Sri Lanka Document Datasets: A Large-Scale, Multilingual Resource forLaw, News, and Policy

Senaratna

We present a collection of open, machine-readable document datasets covering parliamentary proceedings, legal judgments, government publications, news, and tourism statistics from Sri Lanka. The collection currently comprises of 229,858 documents (57.1 GB) across 24 datasets in Sinhala, Tamil, and English. The datasets are updated daily and mirrored on GitHub and Hugging Face. These resources aim to support research in computational linguistics, legal analytics, socio-political studies, and multilingual natural language processing. We describe the data sources, collection pipeline, formats, and potential use cases, while discussing licensing and ethical considerations. This manuscript is at version v2025-10-15-1111.

academic

Sri Lanka Document Datasets: A Large-Scale, Multilingual Resource for Law, News, and Policy

Basic Information

Paper ID: 2510.04124
Title: Sri Lanka Document Datasets: A Large-Scale, Multilingual Resource for Law, News, and Policy
Author: Nuwan I. Senaratna (Independent Researcher)
Classification: cs.CL (Computational Linguistics)
Publication Date: arXiv preprint, v2025-10-16-0818
Paper Link: https://arxiv.org/abs/2510.04124

Abstract

This paper introduces a large-scale, open-source, machine-readable collection of Sri Lankan documents, encompassing parliamentary records, legal judgments, government publications, news articles, and tourism statistics. The collection currently contains 230,091 documents (57.7 GB) spanning 24 datasets, supporting three languages: Sinhala, Tamil, and English. The datasets are updated daily and mirrored on GitHub and Hugging Face. These resources aim to support research in computational linguistics, legal analysis, sociopolitical studies, and multilingual natural language processing.

Research Background and Motivation

Problem Definition

Digitized legal, policy, and media records in Sri Lanka are scattered across numerous government and private sources, with most information existing in PDF or webpage formats lacking machine-readable structure or consistent public archiving. This fragmentation limits citizens', journalists', and researchers' access to information about the country's governance, history, and socioeconomic trends.

Significance

Data Scarcity: South Asia, particularly Sri Lanka, lacks unified and machine-readable public records documentation
Language Diversity: Need for NLP research supporting low-resource languages (Sinhala, Tamil)
Transparency Requirements: Enhanced transparency and verifiability for citizen engagement and academic research
Cross-Domain Applications: Support for legal analysis, policy research, media monitoring, and other fields

Existing Limitations

Global large-scale corpora (e.g., Common Crawl, Wikipedia Dumps) are predominantly dominated by high-resource language data
Regional initiatives are fragmented and typically focus on individual media outlets or institutions
Previous datasets have limitations in scale, language coverage, or temporal continuity

Core Contributions

Construction of Large-Scale Multilingual Document Collection: 230,091 documents spanning 24 distinct datasets
Establishment of Automated Data Collection Pipeline: Enabling continuous discovery, ingestion, parsing, validation, and version control
Provision of Open-Access Data Infrastructure: Fully open datasets under MIT license
Support for Multi-Domain Research Applications: Computational linguistics, legal analysis, sociopolitical research, and others
Assurance of Data Quality and Reproducibility: Standardized formats, version control, and transparent data provenance

Methodology in Detail

Dataset Composition

The paper provides detailed descriptions of 24 datasets, primarily categorized as follows:

1. Legal Documents

Hansard (Parliamentary Records): 1,665 documents, 17.9 GB, 2006-2025
Court of Appeal Judgments: 10,164 documents, 10.5 GB, 2012-2025
Supreme Court Judgments: 2,168 documents, 1.4 GB, 2009-2025
Legislation: 3,934 documents, 6.9 GB, 1981-2025
Bills: 4,080 documents, 1.9 GB, 2010-2025

2. Government Publications

Gazette Extraordinary (2020s): 45,373 documents, 1.3 GB
Gazette Extraordinary (2010s): 56,379 documents, 3.3 GB
Cabinet Decisions: 10,385 documents, 136.4 MB
Ministry of Finance Press Releases: 134 documents, 144.5 MB

3. News and Media

News Documents: 81,155 documents, 1.2 GB, 2021-2025
Presidential Media Division Press Releases: 2,182 documents, 55.9 MB

4. Statistics and Reports

Tourism Statistics Reports: 161 documents, 405.7 MB
Fisheries Statistics Reports: 417 documents, 101.4 MB
Central Bank Annual Reports: 1,137 documents, 3.5 GB

Data Collection Pipeline

Technical Architecture

GitHub Actions Orchestration: Daily multiple runs using cron jobs
Matrix Strategy: Isolation of each data source allowing independent retries
Incremental Updates: Detection of new or modified items through stable keys (URL + date) and content hashing

Crawling Implementation

Tools: Python + Selenium + headless Chrome browser
Dynamic Content Handling: Explicit conditional waits for dynamic content loading
Politeness Constraints: Compliance with robots.txt, request rate limiting, and randomized delays

Data Processing

PDF Parsing: Text, metadata, and layout block extraction using PyMuPDF
Quality Control: Schema validation, mandatory field enforcement, checksum protection
Version Control: Preservation of original artifacts and parsed JSON representations

Technical Innovations

Automated Pipeline: Fully automated data collection, processing, and update workflow
Multi-Format Support: Simultaneous handling of HTML and PDF document formats
Incremental Update Mechanism: Efficient change detection and version control
Quality Assurance: Multi-layered data validation and error handling
Transparency Design: Complete metadata recording and auditable data provenance

Experimental Setup

Data Statistics

Total Documents: 230,091
Total Size: 57.7 GB
Number of Datasets: 24
Language Coverage: Sinhala, Tamil, English
Temporal Span: 1950-2025 (varies by dataset)

Data Quality Assessment

Completeness Checks: Mandatory field validation
Consistency Validation: Format standardization
Duplicate Detection: Content hash-based deduplication
Temporal Validity: Date range verification

Experimental Results

Dataset Scale Analysis

Category	Document Count	Data Size	Primary Language
Legal Documents	62,314	36.7 GB	Primarily English
Government Publications	112,473	5.0 GB	Multilingual
News Media	83,337	1.3 GB	Multilingual
Statistical Reports	5,742	14.7 GB	Primarily English

Temporal Coverage Analysis

Historical Depth: Earliest documents traceable to 1950 (Central Bank annual reports)
Update Frequency: Daily automatic updates
Data Freshness: Most datasets cover through October 2025

Language Distribution

English: Primary language for official government documents and legal judgments
Sinhala: Local news and certain government documents
Tamil: Minority language documentation

Global Large-Scale Corpora

Common Crawl: General web crawling data
Wikipedia Dumps: Wikipedia data dumps
OpenWebText: Open web text corpus

Regional Initiatives

Indian Kanoon: Indian legal corpus
OpenSubtitles: Multilingual subtitle dataset
African News Corpus: African news corpus

South Asian Context

Existing efforts are fragmented and typically focus on individual media institutions
Lack of comprehensive and machine-readable document records
Limitations in scale, language coverage, or temporal continuity

Conclusions and Discussion

Main Conclusions

Successfully constructed Sri Lanka's largest-scale multilingual document dataset
Established a sustainable automated data collection and update mechanism
Provided valuable resources for computational linguistics and digital governance research
Ensured data accessibility and reusability through open licensing

Limitations

Language Processing Accuracy: Parsing accuracy for Sinhala and Tamil requires improvement
OCR Capability Constraints: Limited ability to handle scanned or unstructured PDFs
Coverage Scope: Certain government institutions and media sources remain unincluded
Data Quality Variance: Quality variations exist across different sources

Future Directions

Expanded Coverage: Addition of more government institutions, media sources, and historical archives
Enhanced Language Processing: Improved tokenization, font handling, and multilingual embeddings for Sinhala and Tamil
Integrated OCR Parsing: Experimentation with deep learning-based OCR pipelines combined with layout recognition and language modeling

In-Depth Evaluation

Strengths

Data Scale and Quality: Large-scale dataset of 230,091 documents covering multiple important domains
Excellent Technical Implementation: Fully automated data pipeline ensuring timeliness and consistency
Openness and Transparency: Complete open access under MIT license, adhering to FAIR principles
Multilingual Support: Valuable resources for low-resource language research
High Practical Value: Addresses actual application needs across multiple research domains

Weaknesses

Lack of Evaluation: Absence of quantitative assessment and validation of data quality
Insufficient Application Examples: No concrete use cases or benchmark results provided
Uneven Language Distribution: English documents dominate with relatively limited coverage of other languages
Insufficient Technical Detail: Certain technical implementation details lack sufficient depth

Impact

Academic Contribution: Establishes foundation for digital humanities and computational linguistics research in South Asia
Social Value: Enhances government transparency and supports citizen engagement and oversight
Technical Exemplar: Provides reference for similar data infrastructure development in other developing countries
Sustainability: Establishes sustainable data collection and maintenance mechanisms

Applicable Scenarios

Natural Language Processing: Multilingual model training and evaluation
Legal Technology: Legal document analysis and case law research
Policy Analysis: Government decision tracking and policy change monitoring
Media Research: News trend analysis and sentiment analysis
Digital Governance: E-government and transparency research

References

The paper cites important works across relevant fields, including:

Best practices in MLOps and data pipeline construction
Open data governance frameworks
Ethical and technical standards for web crawling
FAIR principles for scientific data management
Literature on research reproducibility

Overall Assessment: This is a dataset paper with significant practical value, providing valuable infrastructure for digitized research in Sri Lanka and South Asia more broadly. While relatively limited in technical innovation, its contributions in data scale, openness, and sustainability are noteworthy. This work sets a positive example for digital humanities research in low-resource languages and developing countries.