2025-11-19T12:46:13.574656

Sri Lanka Document Datasets: A Large-Scale, Multilingual Resource forLaw, News, and Policy

Senaratna
We present a collection of open, machine-readable document datasets covering parliamentary proceedings, legal judgments, government publications, news, and tourism statistics from Sri Lanka. The collection currently comprises of 229,858 documents (57.1 GB) across 24 datasets in Sinhala, Tamil, and English. The datasets are updated daily and mirrored on GitHub and Hugging Face. These resources aim to support research in computational linguistics, legal analytics, socio-political studies, and multilingual natural language processing. We describe the data sources, collection pipeline, formats, and potential use cases, while discussing licensing and ethical considerations. This manuscript is at version v2025-10-15-1111.
academic

Sri Lanka Document Datasets: A Large-Scale, Multilingual Resource for Law, News, and Policy

Basic Information

  • Paper ID: 2510.04124
  • Title: Sri Lanka Document Datasets: A Large-Scale, Multilingual Resource for Law, News, and Policy
  • Author: Nuwan I. Senaratna (Independent Researcher)
  • Classification: cs.CL (Computational Linguistics)
  • Publication Date: arXiv preprint, v2025-10-16-0818
  • Paper Link: https://arxiv.org/abs/2510.04124

Abstract

This paper introduces a large-scale, open-source, machine-readable collection of Sri Lankan documents, encompassing parliamentary records, legal judgments, government publications, news articles, and tourism statistics. The collection currently contains 230,091 documents (57.7 GB) spanning 24 datasets, supporting three languages: Sinhala, Tamil, and English. The datasets are updated daily and mirrored on GitHub and Hugging Face. These resources aim to support research in computational linguistics, legal analysis, sociopolitical studies, and multilingual natural language processing.

Research Background and Motivation

Problem Definition

Digitized legal, policy, and media records in Sri Lanka are scattered across numerous government and private sources, with most information existing in PDF or webpage formats lacking machine-readable structure or consistent public archiving. This fragmentation limits citizens', journalists', and researchers' access to information about the country's governance, history, and socioeconomic trends.

Significance

  1. Data Scarcity: South Asia, particularly Sri Lanka, lacks unified and machine-readable public records documentation
  2. Language Diversity: Need for NLP research supporting low-resource languages (Sinhala, Tamil)
  3. Transparency Requirements: Enhanced transparency and verifiability for citizen engagement and academic research
  4. Cross-Domain Applications: Support for legal analysis, policy research, media monitoring, and other fields

Existing Limitations

  • Global large-scale corpora (e.g., Common Crawl, Wikipedia Dumps) are predominantly dominated by high-resource language data
  • Regional initiatives are fragmented and typically focus on individual media outlets or institutions
  • Previous datasets have limitations in scale, language coverage, or temporal continuity

Core Contributions

  1. Construction of Large-Scale Multilingual Document Collection: 230,091 documents spanning 24 distinct datasets
  2. Establishment of Automated Data Collection Pipeline: Enabling continuous discovery, ingestion, parsing, validation, and version control
  3. Provision of Open-Access Data Infrastructure: Fully open datasets under MIT license
  4. Support for Multi-Domain Research Applications: Computational linguistics, legal analysis, sociopolitical research, and others
  5. Assurance of Data Quality and Reproducibility: Standardized formats, version control, and transparent data provenance

Methodology in Detail

Dataset Composition

The paper provides detailed descriptions of 24 datasets, primarily categorized as follows:

  • Hansard (Parliamentary Records): 1,665 documents, 17.9 GB, 2006-2025
  • Court of Appeal Judgments: 10,164 documents, 10.5 GB, 2012-2025
  • Supreme Court Judgments: 2,168 documents, 1.4 GB, 2009-2025
  • Legislation: 3,934 documents, 6.9 GB, 1981-2025
  • Bills: 4,080 documents, 1.9 GB, 2010-2025

2. Government Publications

  • Gazette Extraordinary (2020s): 45,373 documents, 1.3 GB
  • Gazette Extraordinary (2010s): 56,379 documents, 3.3 GB
  • Cabinet Decisions: 10,385 documents, 136.4 MB
  • Ministry of Finance Press Releases: 134 documents, 144.5 MB

3. News and Media

  • News Documents: 81,155 documents, 1.2 GB, 2021-2025
  • Presidential Media Division Press Releases: 2,182 documents, 55.9 MB

4. Statistics and Reports

  • Tourism Statistics Reports: 161 documents, 405.7 MB
  • Fisheries Statistics Reports: 417 documents, 101.4 MB
  • Central Bank Annual Reports: 1,137 documents, 3.5 GB

Data Collection Pipeline

Technical Architecture

  1. GitHub Actions Orchestration: Daily multiple runs using cron jobs
  2. Matrix Strategy: Isolation of each data source allowing independent retries
  3. Incremental Updates: Detection of new or modified items through stable keys (URL + date) and content hashing

Crawling Implementation

  • Tools: Python + Selenium + headless Chrome browser
  • Dynamic Content Handling: Explicit conditional waits for dynamic content loading
  • Politeness Constraints: Compliance with robots.txt, request rate limiting, and randomized delays

Data Processing

  1. PDF Parsing: Text, metadata, and layout block extraction using PyMuPDF
  2. Quality Control: Schema validation, mandatory field enforcement, checksum protection
  3. Version Control: Preservation of original artifacts and parsed JSON representations

Technical Innovations

  1. Automated Pipeline: Fully automated data collection, processing, and update workflow
  2. Multi-Format Support: Simultaneous handling of HTML and PDF document formats
  3. Incremental Update Mechanism: Efficient change detection and version control
  4. Quality Assurance: Multi-layered data validation and error handling
  5. Transparency Design: Complete metadata recording and auditable data provenance

Experimental Setup

Data Statistics

  • Total Documents: 230,091
  • Total Size: 57.7 GB
  • Number of Datasets: 24
  • Language Coverage: Sinhala, Tamil, English
  • Temporal Span: 1950-2025 (varies by dataset)

Data Quality Assessment

  • Completeness Checks: Mandatory field validation
  • Consistency Validation: Format standardization
  • Duplicate Detection: Content hash-based deduplication
  • Temporal Validity: Date range verification

Experimental Results

Dataset Scale Analysis

CategoryDocument CountData SizePrimary Language
Legal Documents62,31436.7 GBPrimarily English
Government Publications112,4735.0 GBMultilingual
News Media83,3371.3 GBMultilingual
Statistical Reports5,74214.7 GBPrimarily English

Temporal Coverage Analysis

  • Historical Depth: Earliest documents traceable to 1950 (Central Bank annual reports)
  • Update Frequency: Daily automatic updates
  • Data Freshness: Most datasets cover through October 2025

Language Distribution

  • English: Primary language for official government documents and legal judgments
  • Sinhala: Local news and certain government documents
  • Tamil: Minority language documentation

Global Large-Scale Corpora

  • Common Crawl: General web crawling data
  • Wikipedia Dumps: Wikipedia data dumps
  • OpenWebText: Open web text corpus

Regional Initiatives

  • Indian Kanoon: Indian legal corpus
  • OpenSubtitles: Multilingual subtitle dataset
  • African News Corpus: African news corpus

South Asian Context

  • Existing efforts are fragmented and typically focus on individual media institutions
  • Lack of comprehensive and machine-readable document records
  • Limitations in scale, language coverage, or temporal continuity

Conclusions and Discussion

Main Conclusions

  1. Successfully constructed Sri Lanka's largest-scale multilingual document dataset
  2. Established a sustainable automated data collection and update mechanism
  3. Provided valuable resources for computational linguistics and digital governance research
  4. Ensured data accessibility and reusability through open licensing

Limitations

  1. Language Processing Accuracy: Parsing accuracy for Sinhala and Tamil requires improvement
  2. OCR Capability Constraints: Limited ability to handle scanned or unstructured PDFs
  3. Coverage Scope: Certain government institutions and media sources remain unincluded
  4. Data Quality Variance: Quality variations exist across different sources

Future Directions

  1. Expanded Coverage: Addition of more government institutions, media sources, and historical archives
  2. Enhanced Language Processing: Improved tokenization, font handling, and multilingual embeddings for Sinhala and Tamil
  3. Integrated OCR Parsing: Experimentation with deep learning-based OCR pipelines combined with layout recognition and language modeling

In-Depth Evaluation

Strengths

  1. Data Scale and Quality: Large-scale dataset of 230,091 documents covering multiple important domains
  2. Excellent Technical Implementation: Fully automated data pipeline ensuring timeliness and consistency
  3. Openness and Transparency: Complete open access under MIT license, adhering to FAIR principles
  4. Multilingual Support: Valuable resources for low-resource language research
  5. High Practical Value: Addresses actual application needs across multiple research domains

Weaknesses

  1. Lack of Evaluation: Absence of quantitative assessment and validation of data quality
  2. Insufficient Application Examples: No concrete use cases or benchmark results provided
  3. Uneven Language Distribution: English documents dominate with relatively limited coverage of other languages
  4. Insufficient Technical Detail: Certain technical implementation details lack sufficient depth

Impact

  1. Academic Contribution: Establishes foundation for digital humanities and computational linguistics research in South Asia
  2. Social Value: Enhances government transparency and supports citizen engagement and oversight
  3. Technical Exemplar: Provides reference for similar data infrastructure development in other developing countries
  4. Sustainability: Establishes sustainable data collection and maintenance mechanisms

Applicable Scenarios

  1. Natural Language Processing: Multilingual model training and evaluation
  2. Legal Technology: Legal document analysis and case law research
  3. Policy Analysis: Government decision tracking and policy change monitoring
  4. Media Research: News trend analysis and sentiment analysis
  5. Digital Governance: E-government and transparency research

References

The paper cites important works across relevant fields, including:

  • Best practices in MLOps and data pipeline construction
  • Open data governance frameworks
  • Ethical and technical standards for web crawling
  • FAIR principles for scientific data management
  • Literature on research reproducibility

Overall Assessment: This is a dataset paper with significant practical value, providing valuable infrastructure for digitized research in Sri Lanka and South Asia more broadly. While relatively limited in technical innovation, its contributions in data scale, openness, and sustainability are noteworthy. This work sets a positive example for digital humanities research in low-resource languages and developing countries.