2025-11-11T14:34:09.551839

VayuChat: An LLM-Powered Conversational Interface for Air Quality Data Analytics

Acharya, Pisharodi, Mondal et al.

Air pollution causes about 1.6 million premature deaths each year in India, yet decision makers struggle to turn dispersed data into decisions. Existing tools require expertise and provide static dashboards, leaving key policy questions unresolved. We present VayuChat, a conversational system that answers natural language questions on air quality, meteorology, and policy programs, and responds with both executable Python code and interactive visualizations. VayuChat integrates data from Central Pollution Control Board (CPCB) monitoring stations, state-level demographics, and National Clean Air Programme (NCAP) funding records into a unified interface powered by large language models. Our live demonstration will show how users can perform complex environmental analytics through simple conversations, making data science accessible to policymakers, researchers, and citizens. The platform is publicly deployed at https://huggingface.co/spaces/SustainabilityLabIITGN/ VayuChat. For further information check out video uploaded on https://www.youtube.com/watch?v=d6rklL05cs4.

academic

VayuChat: An LLM-Powered Conversational Interface for Air Quality Data Analytics

Basic Information

Paper ID: 2511.01046
Title: VayuChat: An LLM-Powered Conversational Interface for Air Quality Data Analytics
Authors: Vedant Acharya, Abhay Pisharodi, Rishabh Mondal, Mohammad Rafiuddin, Nipun Batra
Classification: cs.CL (Computation and Language)
Publication Venue/Conference: CODS 2025 (13th International Conference on Data Science)
Paper Link: https://arxiv.org/abs/2511.01046
System Deployment: https://huggingface.co/spaces/SustainabilityLabIITGN/VayuChat

Abstract

Approximately 1.6 million premature deaths occur annually in India due to air pollution, yet policymakers struggle to transform dispersed data into actionable insights. Existing tools require specialized expertise and offer only static dashboards, failing to address critical policy questions. This paper presents VayuChat, a conversational system capable of answering natural language questions about air quality, meteorology, and policy initiatives while providing executable Python code and interactive visualizations. VayuChat integrates Central Pollution Control Board (CPCB) monitoring station data, state-level demographic data, and National Clean Air Program (NCAP) funding records through a unified interface powered by large language models. The platform enables policymakers, researchers, and citizens to conduct complex environmental analyses through simple conversation.

Research Background and Motivation

Problem Definition

Severe Public Health Crisis: Air pollution in India causes 1.6 million premature deaths annually, with PM2.5 exposure reducing life expectancy by over 5 years
Data Utilization Barriers: Despite continuous collection of nationwide pollutant measurements by CPCB, converting raw data into timely policy-relevant insights remains challenging
High Technical Barriers: Existing tools require specialized knowledge, offer limited visualization capabilities, or address only narrow task scopes

Limitations of Existing Approaches

Require specialized technical skills for operation
Provide static dashboards lacking interactivity
Cannot handle complex cross-dataset analyses
Simple queries such as "How did Delhi's PM2.5 change last year?" remain difficult to answer
Policy questions like "Which cities reduced PM2.5 most relative to NCAP funding?" require integrating pollution, funding, and demographic data

Research Motivation

Leveraging the natural language understanding and code generation capabilities of large language models to construct a system capable of:

Reducing technical barriers to environmental data analysis
Providing transparent and reproducible analytical results
Integrating heterogeneous multi-source data
Supporting complex policy-relevant queries

Core Contributions

Developed the first LLM-driven conversational system for air quality analysis: VayuChat processes natural language queries and generates executable Python code and visualization results
Integrated multi-source environmental data: Incorporates CPCB air quality and meteorological observations (2017-2024), state-level population and area data, and NCAP funding allocation records
Provided transparent code generation mechanisms: Reduces hallucinations by generating Python code rather than direct outputs, ensuring result verifiability and reproducibility
Supports multiple analysis types: Including direct queries, plot generation, correlation analysis, and policy impact assessment
Validated through practical case studies: Demonstrates system utility through in-depth analysis of Delhi's December 2024 air pollution crisis

Methodology Details

Task Definition

Input: User natural language queries concerning air quality, meteorological data, or policy analysis

Output:

Executable Python code
Data analysis results (text, tables, or visualization charts)
Direct answers to queries

Constraints:

Code must be based on predefined dataset schemas
Results must be verifiable and reproducible
Support comparative evaluation across multiple LLM models

System Architecture

Frontend Interface Design

VayuChat provides a browser-based interface with four core functional modules:

Model Selector: Supports multiple state-of-the-art models (GPT-OSS 20B/120B, Qwen3-32B, Llama series, DeepSeek-R1, Gemini, etc.)
Quick Query Options: Predefined air quality-related question templates
Custom Query Input: Supports arbitrary user natural language queries
Code Display Area: Shows generated Python code, ensuring transparency

Backend Processing Pipeline

User Query → System Prompt Combination → LLM Code Generation → Sandbox Execution → Result Display

Dataset Integration

CPCB Dataset:

Time Range: 2017-2024
Pollutant Indicators: PM2.5, PM10, NO, NO2, NOx, NH3, SO2, CO, ozone (units: μg/m³, etc.)
Meteorological Variables: Temperature, relative humidity, wind speed, wind direction, precipitation, solar radiation, atmospheric pressure, vertical wind speed
Station Metadata: City, state, CPCB-assigned station ID

State-Level Demographic Data:

Covers 31 Indian regions
Includes 2011 census data
Area information (km²)
Union territory identifiers

NCAP Funding Data:

Time Range: 2019-2022
Records fund disbursement by fiscal year for each city
Funding utilization status as of June 2022

Technical Innovations

1. Code Generation-Based Hallucination Reduction Mechanism

Traditional approaches directly providing raw tabular data to LLMs are prone to hallucinations. VayuChat employs the following strategies:

Provides dataset schema descriptions in system prompts
LLM generates Python code rather than direct answers
Ensures result accuracy through code execution

2. Multi-Model Support Architecture

Integrates open-source models (via Groq Cloud API) and commercial models (via Gemini API)
Supports comparative model performance evaluation
Enables selection of optimal models for different query types

3. Secure Code Execution Environment

Executes generated code in sandboxed environments
Prevents potential system security risks
Automatically captures execution results and integrates them into responses

Experimental Setup

Dataset Details

CPCB Monitoring Network:

Covers 500+ monitoring stations nationwide
37 monitoring stations in Delhi used for case study
Daily measurement frequency with quality control flags

Evaluation Benchmarks:

Constructed VayuBench evaluation benchmark (detailed content beyond paper scope)
Collaborated with air quality analysis experts for real-world scenario validation

System Capability Assessment

VayuChat supports three primary query categories:

Direct Queries:

"Which city had the highest PM2.5 in 2023?"
"Show Delhi's SO2 levels"

Plot Generation:

"Plot Mumbai's PM2.5 trend"
"Compare ozone levels between Punjab and Gujarat"

Analysis Queries:

"Analyze the correlation between wind speed and PM2.5"
"Assess NCAP's impact on air quality"

Experimental Results

Delhi Air Quality Crisis Case Study

The paper demonstrates VayuChat's practical application value through collaboration with air quality analysts investigating the causes of severe pollution surge in Delhi in December 2024.

1. Most Severe Pollution Days Identification

Query: "Which days in December 2024 had the worst pollution in Delhi?"

Results:

Date	PM2.5 (μg/m³)
2024-12-18	344.59
2024-12-19	341.46
2024-12-17	330.25
2024-12-20	291.46
2024-12-22	285.98

2. Wind Speed and Pollution Relationship Analysis

Query: "Use time series plots to compare pollution levels and wind speed during Delhi's most polluted week in December 2024 with the 15 days before and after"

Key Findings:

Clear negative correlation between wind speed and PM2.5
PM2.5 exceeds 300 μg/m³ when wind speed drops below 1.0 m/s
Even modest wind speed decreases (0.6 m/s) can rapidly degrade air quality from "very poor" to "severe"

3. Five-Year Historical Comparison

Query: "Plot and compare Delhi's pollution during the polluted week in December 2024 with data from the previous five years"

Findings:

2024 wind speeds showed slight improvement compared to previous years
2019 and 2020 exhibited strong negative correlation between PM2.5 and wind speed
2023 recorded the lowest wind speed (0.6 m/s)
2021 had the highest PM2.5 levels (325 μg/m³)

4. Multi-Pollutant Correlation Analysis

Query: "Analyze the correlation between CO, NO2, and PM2.5 in Delhi during December since 2017"

Correlation Matrix:

Pollutant	CO	NO2	PM2.5
CO	1	0.3	0.47
NO2	0.3	1	0.34
PM2.5	0.47	0.34	1

Insights: PM2.5 shows strongest correlation with CO (r=0.47), indicating that vehicular emissions, stubble burning, and industrial emissions from common sources drive synchronized pollution events.

System Performance

Successfully handled complex multi-step analytical queries
Generated accurate visualization charts
Provided verifiable Python code
Supported complete analytical workflows from crisis identification to mechanistic insights

Environmental Data Analysis Tools

openair R Package: Professional air quality data analysis tool requiring R programming skills
CPCB Official Dashboard: Provides real-time data but with limited functionality and shallow analytical capabilities
Traditional BI Tools: Require specialized configuration expertise and struggle with natural language queries

LLM Code Generation

GPT-3/4 Code Capabilities: Excellent performance on general programming tasks but lacking domain-specific optimization
Instruction-Following Models: Show promise on tabular reasoning tasks but limited environmental domain applications
Zero-Shot Table Reasoning: Related techniques provide foundation for this paper's approach

Conversational Data Analysis

This paper represents the first LLM-driven conversational system specifically designed for environmental data analysis, filling a gap in the field.

Conclusions and Discussion

Main Conclusions

Technical Feasibility: LLMs can effectively handle complex environmental data analysis queries, with code generation mechanisms ensuring result accuracy
Practical Value: System successfully supported in-depth analysis of Delhi's air pollution crisis, demonstrating real-world application potential
Improved Accessibility: Significantly reduces technical barriers to environmental data analysis, enabling non-technical users to conduct complex analyses

Limitations

Data Coverage Scope: Currently primarily based on Indian CPCB data with limited geographic coverage
Insufficient Real-Time Capability: Has not yet integrated real-time data streams; analyses based on historical data
Model Dependency: System performance depends on underlying LLM code generation capabilities
Complex Query Processing: Automatic query decomposition and multi-step reasoning not yet implemented

Future Directions

Real-Time Data Integration: Integrate real-time air quality data streams via APIs
Data Expansion: Add ERA5 reanalysis data, satellite products, land use, and emission inventories
Model Fine-Tuning: Specialized optimization for environmental domain
Automatic Reasoning Workflows: Implement automatic decomposition and multi-step analysis for complex queries

In-Depth Evaluation

Strengths

Strong Innovation: First LLM-driven conversational analysis system for environmental data with novel technical approach
High Practical Value: Delhi pollution case study demonstrates real-world application value with important policy implications
Sound Technical Design: Code generation approach to reduce hallucinations is scientifically feasible
Complete System Integration: Forms complete closed-loop from data integration, model selection to result presentation
High Transparency: Provides generated code ensuring result verifiability and reproducibility

Weaknesses

Insufficient Evaluation: VayuBench details not presented in paper; lacks quantitative performance metrics
Limited Case Studies: Primarily based on single Delhi case; lacks broader validation
Insufficient Technical Details: Key technical details like LLM fine-tuning and prompt engineering inadequately described
Error Handling Mechanisms: Lacks detailed discussion of code generation error or execution failure handling strategies
User Experience Assessment: Lacks real user feedback and satisfaction evaluation

Impact

Academic Contribution: Provides important reference for LLM applications in environmental science
Social Value: Improves environmental data utilization efficiency, supporting better policy decisions
Technical Demonstration: Provides design insights for other specialized data analysis systems
Openness: Public system deployment promotes technology dissemination and application

Applicable Scenarios

Government Decision-Making: Environmental department policy formulation and project evaluation
Academic Research: Environmental science and public health research
News Media: Data-driven environmental reporting
Public Education: Raising public awareness of air quality issues
NGO Organizations: Environmental monitoring and advocacy activities

References

The paper cites 15 relevant references covering LLM foundational technologies, environmental data analysis tools, and health impacts of air pollution, providing sufficient theoretical foundation and comparative references.

Overall Assessment: This is an excellent paper combining technical innovation with practical application, pioneering in significance for LLM applications in environmental science. The system design is sound, case study analysis is thorough, and it holds important value for addressing environmental data utilization challenges in developing countries like India. While there is room for improvement in evaluation and technical details, the overall contribution is substantial with excellent prospects for promotion and application.