2025-11-11T14:34:09.551839

VayuChat: An LLM-Powered Conversational Interface for Air Quality Data Analytics

Acharya, Pisharodi, Mondal et al.
Air pollution causes about 1.6 million premature deaths each year in India, yet decision makers struggle to turn dispersed data into decisions. Existing tools require expertise and provide static dashboards, leaving key policy questions unresolved. We present VayuChat, a conversational system that answers natural language questions on air quality, meteorology, and policy programs, and responds with both executable Python code and interactive visualizations. VayuChat integrates data from Central Pollution Control Board (CPCB) monitoring stations, state-level demographics, and National Clean Air Programme (NCAP) funding records into a unified interface powered by large language models. Our live demonstration will show how users can perform complex environmental analytics through simple conversations, making data science accessible to policymakers, researchers, and citizens. The platform is publicly deployed at https://huggingface.co/spaces/SustainabilityLabIITGN/ VayuChat. For further information check out video uploaded on https://www.youtube.com/watch?v=d6rklL05cs4.
academic

VayuChat: An LLM-Powered Conversational Interface for Air Quality Data Analytics

Basic Information

Abstract

Approximately 1.6 million premature deaths occur annually in India due to air pollution, yet policymakers struggle to transform dispersed data into actionable insights. Existing tools require specialized expertise and offer only static dashboards, failing to address critical policy questions. This paper presents VayuChat, a conversational system capable of answering natural language questions about air quality, meteorology, and policy initiatives while providing executable Python code and interactive visualizations. VayuChat integrates Central Pollution Control Board (CPCB) monitoring station data, state-level demographic data, and National Clean Air Program (NCAP) funding records through a unified interface powered by large language models. The platform enables policymakers, researchers, and citizens to conduct complex environmental analyses through simple conversation.

Research Background and Motivation

Problem Definition

  1. Severe Public Health Crisis: Air pollution in India causes 1.6 million premature deaths annually, with PM2.5 exposure reducing life expectancy by over 5 years
  2. Data Utilization Barriers: Despite continuous collection of nationwide pollutant measurements by CPCB, converting raw data into timely policy-relevant insights remains challenging
  3. High Technical Barriers: Existing tools require specialized knowledge, offer limited visualization capabilities, or address only narrow task scopes

Limitations of Existing Approaches

  • Require specialized technical skills for operation
  • Provide static dashboards lacking interactivity
  • Cannot handle complex cross-dataset analyses
  • Simple queries such as "How did Delhi's PM2.5 change last year?" remain difficult to answer
  • Policy questions like "Which cities reduced PM2.5 most relative to NCAP funding?" require integrating pollution, funding, and demographic data

Research Motivation

Leveraging the natural language understanding and code generation capabilities of large language models to construct a system capable of:

  • Reducing technical barriers to environmental data analysis
  • Providing transparent and reproducible analytical results
  • Integrating heterogeneous multi-source data
  • Supporting complex policy-relevant queries

Core Contributions

  1. Developed the first LLM-driven conversational system for air quality analysis: VayuChat processes natural language queries and generates executable Python code and visualization results
  2. Integrated multi-source environmental data: Incorporates CPCB air quality and meteorological observations (2017-2024), state-level population and area data, and NCAP funding allocation records
  3. Provided transparent code generation mechanisms: Reduces hallucinations by generating Python code rather than direct outputs, ensuring result verifiability and reproducibility
  4. Supports multiple analysis types: Including direct queries, plot generation, correlation analysis, and policy impact assessment
  5. Validated through practical case studies: Demonstrates system utility through in-depth analysis of Delhi's December 2024 air pollution crisis

Methodology Details

Task Definition

Input: User natural language queries concerning air quality, meteorological data, or policy analysis

Output:

  • Executable Python code
  • Data analysis results (text, tables, or visualization charts)
  • Direct answers to queries

Constraints:

  • Code must be based on predefined dataset schemas
  • Results must be verifiable and reproducible
  • Support comparative evaluation across multiple LLM models

System Architecture

Frontend Interface Design

VayuChat provides a browser-based interface with four core functional modules:

  1. Model Selector: Supports multiple state-of-the-art models (GPT-OSS 20B/120B, Qwen3-32B, Llama series, DeepSeek-R1, Gemini, etc.)
  2. Quick Query Options: Predefined air quality-related question templates
  3. Custom Query Input: Supports arbitrary user natural language queries
  4. Code Display Area: Shows generated Python code, ensuring transparency

Backend Processing Pipeline

User Query → System Prompt Combination → LLM Code Generation → Sandbox Execution → Result Display

Dataset Integration

CPCB Dataset:

  • Time Range: 2017-2024
  • Pollutant Indicators: PM2.5, PM10, NO, NO2, NOx, NH3, SO2, CO, ozone (units: μg/m³, etc.)
  • Meteorological Variables: Temperature, relative humidity, wind speed, wind direction, precipitation, solar radiation, atmospheric pressure, vertical wind speed
  • Station Metadata: City, state, CPCB-assigned station ID

State-Level Demographic Data:

  • Covers 31 Indian regions
  • Includes 2011 census data
  • Area information (km²)
  • Union territory identifiers

NCAP Funding Data:

  • Time Range: 2019-2022
  • Records fund disbursement by fiscal year for each city
  • Funding utilization status as of June 2022

Technical Innovations

1. Code Generation-Based Hallucination Reduction Mechanism

Traditional approaches directly providing raw tabular data to LLMs are prone to hallucinations. VayuChat employs the following strategies:

  • Provides dataset schema descriptions in system prompts
  • LLM generates Python code rather than direct answers
  • Ensures result accuracy through code execution

2. Multi-Model Support Architecture

  • Integrates open-source models (via Groq Cloud API) and commercial models (via Gemini API)
  • Supports comparative model performance evaluation
  • Enables selection of optimal models for different query types

3. Secure Code Execution Environment

  • Executes generated code in sandboxed environments
  • Prevents potential system security risks
  • Automatically captures execution results and integrates them into responses

Experimental Setup

Dataset Details

CPCB Monitoring Network:

  • Covers 500+ monitoring stations nationwide
  • 37 monitoring stations in Delhi used for case study
  • Daily measurement frequency with quality control flags

Evaluation Benchmarks:

  • Constructed VayuBench evaluation benchmark (detailed content beyond paper scope)
  • Collaborated with air quality analysis experts for real-world scenario validation

System Capability Assessment

VayuChat supports three primary query categories:

Direct Queries:

  • "Which city had the highest PM2.5 in 2023?"
  • "Show Delhi's SO2 levels"

Plot Generation:

  • "Plot Mumbai's PM2.5 trend"
  • "Compare ozone levels between Punjab and Gujarat"

Analysis Queries:

  • "Analyze the correlation between wind speed and PM2.5"
  • "Assess NCAP's impact on air quality"

Experimental Results

Delhi Air Quality Crisis Case Study

The paper demonstrates VayuChat's practical application value through collaboration with air quality analysts investigating the causes of severe pollution surge in Delhi in December 2024.

1. Most Severe Pollution Days Identification

Query: "Which days in December 2024 had the worst pollution in Delhi?"

Results:

DatePM2.5 (μg/m³)
2024-12-18344.59
2024-12-19341.46
2024-12-17330.25
2024-12-20291.46
2024-12-22285.98

2. Wind Speed and Pollution Relationship Analysis

Query: "Use time series plots to compare pollution levels and wind speed during Delhi's most polluted week in December 2024 with the 15 days before and after"

Key Findings:

  • Clear negative correlation between wind speed and PM2.5
  • PM2.5 exceeds 300 μg/m³ when wind speed drops below 1.0 m/s
  • Even modest wind speed decreases (0.6 m/s) can rapidly degrade air quality from "very poor" to "severe"

3. Five-Year Historical Comparison

Query: "Plot and compare Delhi's pollution during the polluted week in December 2024 with data from the previous five years"

Findings:

  • 2024 wind speeds showed slight improvement compared to previous years
  • 2019 and 2020 exhibited strong negative correlation between PM2.5 and wind speed
  • 2023 recorded the lowest wind speed (0.6 m/s)
  • 2021 had the highest PM2.5 levels (325 μg/m³)

4. Multi-Pollutant Correlation Analysis

Query: "Analyze the correlation between CO, NO2, and PM2.5 in Delhi during December since 2017"

Correlation Matrix:

PollutantCONO2PM2.5
CO10.30.47
NO20.310.34
PM2.50.470.341

Insights: PM2.5 shows strongest correlation with CO (r=0.47), indicating that vehicular emissions, stubble burning, and industrial emissions from common sources drive synchronized pollution events.

System Performance

  • Successfully handled complex multi-step analytical queries
  • Generated accurate visualization charts
  • Provided verifiable Python code
  • Supported complete analytical workflows from crisis identification to mechanistic insights

Environmental Data Analysis Tools

  • openair R Package: Professional air quality data analysis tool requiring R programming skills
  • CPCB Official Dashboard: Provides real-time data but with limited functionality and shallow analytical capabilities
  • Traditional BI Tools: Require specialized configuration expertise and struggle with natural language queries

LLM Code Generation

  • GPT-3/4 Code Capabilities: Excellent performance on general programming tasks but lacking domain-specific optimization
  • Instruction-Following Models: Show promise on tabular reasoning tasks but limited environmental domain applications
  • Zero-Shot Table Reasoning: Related techniques provide foundation for this paper's approach

Conversational Data Analysis

This paper represents the first LLM-driven conversational system specifically designed for environmental data analysis, filling a gap in the field.

Conclusions and Discussion

Main Conclusions

  1. Technical Feasibility: LLMs can effectively handle complex environmental data analysis queries, with code generation mechanisms ensuring result accuracy
  2. Practical Value: System successfully supported in-depth analysis of Delhi's air pollution crisis, demonstrating real-world application potential
  3. Improved Accessibility: Significantly reduces technical barriers to environmental data analysis, enabling non-technical users to conduct complex analyses

Limitations

  1. Data Coverage Scope: Currently primarily based on Indian CPCB data with limited geographic coverage
  2. Insufficient Real-Time Capability: Has not yet integrated real-time data streams; analyses based on historical data
  3. Model Dependency: System performance depends on underlying LLM code generation capabilities
  4. Complex Query Processing: Automatic query decomposition and multi-step reasoning not yet implemented

Future Directions

  1. Real-Time Data Integration: Integrate real-time air quality data streams via APIs
  2. Data Expansion: Add ERA5 reanalysis data, satellite products, land use, and emission inventories
  3. Model Fine-Tuning: Specialized optimization for environmental domain
  4. Automatic Reasoning Workflows: Implement automatic decomposition and multi-step analysis for complex queries

In-Depth Evaluation

Strengths

  1. Strong Innovation: First LLM-driven conversational analysis system for environmental data with novel technical approach
  2. High Practical Value: Delhi pollution case study demonstrates real-world application value with important policy implications
  3. Sound Technical Design: Code generation approach to reduce hallucinations is scientifically feasible
  4. Complete System Integration: Forms complete closed-loop from data integration, model selection to result presentation
  5. High Transparency: Provides generated code ensuring result verifiability and reproducibility

Weaknesses

  1. Insufficient Evaluation: VayuBench details not presented in paper; lacks quantitative performance metrics
  2. Limited Case Studies: Primarily based on single Delhi case; lacks broader validation
  3. Insufficient Technical Details: Key technical details like LLM fine-tuning and prompt engineering inadequately described
  4. Error Handling Mechanisms: Lacks detailed discussion of code generation error or execution failure handling strategies
  5. User Experience Assessment: Lacks real user feedback and satisfaction evaluation

Impact

  1. Academic Contribution: Provides important reference for LLM applications in environmental science
  2. Social Value: Improves environmental data utilization efficiency, supporting better policy decisions
  3. Technical Demonstration: Provides design insights for other specialized data analysis systems
  4. Openness: Public system deployment promotes technology dissemination and application

Applicable Scenarios

  1. Government Decision-Making: Environmental department policy formulation and project evaluation
  2. Academic Research: Environmental science and public health research
  3. News Media: Data-driven environmental reporting
  4. Public Education: Raising public awareness of air quality issues
  5. NGO Organizations: Environmental monitoring and advocacy activities

References

The paper cites 15 relevant references covering LLM foundational technologies, environmental data analysis tools, and health impacts of air pollution, providing sufficient theoretical foundation and comparative references.


Overall Assessment: This is an excellent paper combining technical innovation with practical application, pioneering in significance for LLM applications in environmental science. The system design is sound, case study analysis is thorough, and it holds important value for addressing environmental data utilization challenges in developing countries like India. While there is room for improvement in evaluation and technical details, the overall contribution is substantial with excellent prospects for promotion and application.