2025-11-18T18:43:13.867270

StreetLens: Enabling Human-Centered AI Agents for Neighborhood Assessment from Street View Imagery

Kim, Jang, Chiang et al.

Traditionally, neighborhood studies have used interviews, surveys, and manual image annotation guided by detailed protocols to identify environmental characteristics, including physical disorder, decay, street safety, and sociocultural symbols, and to examine their impact on developmental and health outcomes. Although these methods yield rich insights, they are time-consuming and require intensive expert intervention. Recent technological advances, including vision language models (VLMs), have begun to automate parts of this process; however, existing efforts are often ad hoc and lack adaptability across research designs and geographic contexts. In this paper, we present StreetLens, a user-configurable human-centered workflow that integrates relevant social science expertise into a VLM for scalable neighborhood environmental assessments. StreetLens mimics the process of trained human coders by focusing the analysis on questions derived from established interview protocols, retrieving relevant street view imagery (SVI), and generating a wide spectrum of semantic annotations from objective features (e.g., the number of cars) to subjective perceptions (e.g., the sense of disorder in an image). By enabling researchers to define the VLM's role through domain-informed prompting, StreetLens places domain knowledge at the core of the analysis process. It also supports the integration of prior survey data to enhance robustness and expand the range of characteristics assessed in diverse settings. StreetLens represents a shift toward flexible and agentic AI systems that work closely with researchers to accelerate and scale neighborhood studies. StreetLens is publicly available at https://knowledge-computing.github.io/projects/streetlens.

academic

StreetLens: Enabling Human-Centered AI Agents for Neighborhood Assessment from Street View Imagery

Basic Information

Paper ID: 2506.14670
Title: StreetLens: Enabling Human-Centered AI Agents for Neighborhood Assessment from Street View Imagery
Authors: Jina Kim, Leeje Jang, Yao-Yi Chiang, Guanyu Wang, Michelle C. Pasco (University of Minnesota)
Classification: cs.HC (Human-Computer Interaction), cs.AI (Artificial Intelligence)
Conference: The 1st ACM SIGSPATIAL International Workshop on Human-Centered Geospatial Computing (GeoHCC '25)
Paper Link: https://arxiv.org/abs/2506.14670
Project Link: https://knowledge-computing.github.io/projects/streetlens

Abstract

Traditional neighborhood research relies on interviews, surveys, and manual image annotation based on detailed protocols to identify environmental characteristics, including physical disorder, decay, street safety, and sociocultural symbols, and to investigate their impacts on development and health outcomes. While these methods generate rich insights, they are time-consuming and require intensive expert intervention. This paper proposes StreetLens, a user-configurable human-centered workflow that integrates relevant social science expertise into vision-language models (VLMs) for scalable neighborhood environmental assessment.

Research Background and Motivation

Problem Definition

Neighborhood environmental assessment traditionally faces the following challenges:

Labor Intensity: Requires trained coders for systematic social observation (SSO), with multiple coders annotating the same image to ensure reliability
Scalability Limitations: Manual methods are difficult to scale to large geographic areas and diverse research contexts
Expert Dependency: Requires continuous involvement and supervision of domain experts
Standardization Difficulties: Lacks adaptive systematic approaches across research designs and geographic contexts

Research Significance

Neighborhood environmental characteristic assessment is crucial for understanding how environments influence:

Adolescent development
Mental health
Social cohesion
Public health outcomes

Limitations of Existing Methods

Traditional Approaches: While providing valuable insights, the process is cumbersome, expert-dependent, and difficult to scale
Existing VLM Applications: Mostly ad-hoc applications lacking structured frameworks, unable to systematically "train" VLMs to work like human coders
Lack of Feedback Mechanisms: Existing methods typically accept VLM results directly without providing researcher feedback

Core Contributions

Proposed StreetLens Workflow: The first end-to-end, researcher-centered systematic social observation workflow that simulates human coder training processes
Human-Machine Collaboration Framework: Incorporates domain knowledge as a core component of the analysis process through role prompting
Automated Prompt Tuning: Automatically generates domain-specific prompts based on relevant research literature and coding manuals
Enhanced Interpretability: Provides explanations of VLM decisions and feedback mechanisms
Open-Source Accessibility: Provides Google Colab notebooks to lower technical barriers

Methodology Details

Task Definition

Inputs:

Research area specifications
Coding manuals and protocols
Relevant academic papers
Example annotations
Street View Images (SVI)

Outputs:

Structured environmental feature assessments
Semantic annotations ranging from objective features (e.g., number of cars) to subjective perceptions (e.g., sense of disorder)
Assessment explanations and feedback

System Architecture

StreetLens comprises four core modules:

M1. Data Processor

Function: Collects and organizes input materials
Input Processing:
- Research area selection (based on U.S. Census TIGER road data, sampled at 5-meter intervals)
- Material upload (coding manuals, protocols, relevant papers, example annotations)
- Google Street View image retrieval
Output: Structured input dataset

M2. Automated Prompt Tuning

Role Generation: Generates VLM professional role descriptions based on relevant paper abstracts

Prompt Template:
"You are an expert in the following fields and the author of the paper abstracts provided here: [paper abstracts]. Based on the expertise demonstrated, generate a general professional role description of yourself in one to two sentences, starting with 'You are' written in the second person."

Task Classification: Distinguishes between subjective perception tasks vs. objective detection tasks

Classification Prompt:
"You are a classifier of annotation tasks... If it asks to rate/assess overall condition or quality, label as perception. If it asks to detect, count, or verify specific objects, label as object_detection."

Coding Manual Processing: Converts question-answer pairs into structured prompts

M3. Vision-Language Model Processor

Model Selection: Uses open-source lightweight VLM InternVL3-2B
- Image Encoder: InternViT-300M-448px-V2_5
- Language Model: Qwen2.5-1.5B
Processing Pipeline:
1. Image encoding and embedding
2. Integration with prompts generated by M2
3. Utilization of context learning from example image-answer pairs
4. Generation of environmental feature assessments

M4. Feedback Provider

Explanation Generation: Provides reasoning explanations for VLM assessments
Interpretability: Helps researchers understand the AI agent's decision-making process
Example: Explanation for 'Decay 1' measurement: "There are only slight cracks, and any potholes present have been fixed or covered"

Technical Innovations

Domain Knowledge Integration: Embeds social science expertise into VLMs through role prompting
Task Adaptation: Automatically identifies and adapts to different types of assessment tasks (perception vs. detection)
Context Learning: Leverages expert-annotated examples to enhance model performance
Human-Machine Collaboration Design: Simulates human coder training processes, including literature review, protocol study, and example examination

Case Study

Research Background

Based on Pasco and White (2020)'s family social science research:

Research Objective: Assess the relationship between neighborhood environment and adolescent racial labeling behavior
Methodology: Trained human coders using systematic social observation (SSO) protocols
Assessment Content: Physical decay levels, sociocultural symbols, etc.
Validation Method: Assessed inter-coder reliability through intraclass correlation coefficient (ICC)

StreetLens Application

Participates in the assessment process as an additional intelligent coder
Uses relevant research literature to define the VLM role
Processes specific questions from coding manuals (e.g., "Disorder 3")
Provides interpretable assessment results

Experimental Setup

Data Sources

Street View Images: Google Street View imagery
Geographic Data: U.S. Census TIGER road data
Sampling Strategy: Predefined point locations at 5-meter intervals
Case Data: Manual annotations from the original case study

Technical Implementation

Deployment Platform: Google Colab notebook
Server: University of Minnesota, connected via Cloudflare
User Interface: Modular button design supporting independent exploration of module functions

Evolution of Traditional Methods

Early Research: Sampson and Raudenbush (1999) used video to assess physical disorder in 23,000 street segments in Chicago
Virtual Audits: Subsequent research adopted Google Earth and Street View for remote assessment
Computer Vision Methods: Detection of urban greenery, sidewalk quality, and other physical features

Current VLM Applications

Walkability Assessment: Using VLMs to evaluate urban walkability
Structured Descriptions: Generating structured descriptions of urban environments
Object Detection: Detecting specific objects in audit categories

StreetLens Advantages

Compared to existing work, StreetLens provides:

End-to-end researcher-centered workflow
Systematic VLM training process simulation
Adaptability across research designs and geographic contexts

Conclusions and Discussion

Main Conclusions

Workflow Effectiveness: StreetLens successfully simulates human coder training and assessment processes
Domain Knowledge Integration: Effectively integrates social science expertise through role prompting
Scalability Enhancement: Significantly improves the scalability of neighborhood environmental assessment
Human-Machine Collaboration: Achieves effective collaboration between AI and researchers

Limitations

Model Bias: VLMs may exhibit bias when interpreting sociocultural contexts in diverse neighborhoods
Assessment Validation: Requires more systematic evaluation methods (e.g., ICC) to validate the reliability of automated coding
Feedback Mechanisms: Current feedback loops are limited, requiring more interactive improvement features

Future Directions

Enhanced Human-Machine Interaction:
- Add feedback loops allowing researchers to explain and improve StreetLens decisions
- Explore different types of automated coders
- Develop automated methods more closely resembling human coding
Improved Evaluation Methods:
- Use intraclass correlation coefficient (ICC) treating automated coders as human annotators
- Provide feedback mechanisms to monitor output reasonableness and reliability
- Enhance convenience of result review and improvement
Bias Mitigation:
- Evaluate potential bias sources
- Apply participatory design methods in collaboration with domain experts
- Ensure responsible and human-centered characteristics of the tool

In-Depth Evaluation

Strengths

Strong Innovation: First to propose a VLM workflow that systematically simulates human coder training processes
High Practical Value: Addresses actual pain points in neighborhood research with broad application prospects
Reasonable Technical Solution: Clear four-module design with feasible technical approach
Open-Source Friendly: Provides Google Colab implementation, lowering usage barriers
Interdisciplinary Integration: Effectively combines AI technology with social science methodology

Weaknesses

Insufficient Evaluation: Lacks systematic comparative experiments with human coders
Bias Risk: Insufficient discussion of VLM bias in sociocultural interpretation
Unverified Generalization: Based on only one case study, lacking multi-scenario validation
Limited Technical Details: Limited analysis of specific prompt engineering strategies and effects

Impact

Academic Contribution: Provides a new paradigm for human-machine collaboration in geospatial computing
Practical Value: Can significantly improve efficiency and scale of neighborhood research
Cross-Disciplinary Impact: Applicable to urban planning, public health, sociology, and other fields
Methodological Innovation: Provides a reference framework for VLM applications in domain-specific tasks

Applicable Scenarios

Urban Research: Large-scale neighborhood environmental feature assessment
Public Health: Research on environmental factors' impact on health
Sociological Research: Analysis of relationships between community characteristics and social phenomena
Urban Planning: Visual feature-based urban environment assessment

Ethical Considerations

The paper explicitly acknowledges potential social bias in machine learning models, particularly when interpreting sociocultural contexts in diverse neighborhoods. The authors plan to evaluate potential bias sources in future work and collaborate with domain experts using participatory design methods to ensure StreetLens functions as a responsible, human-centered tool.

References

The paper cites important works in relevant fields, including:

Classical research on neighborhood environmental assessment (Sampson & Raudenbush, 1999)
Development of virtual audit methods (Odgers et al., 2012; Clarke et al., 2010)
VLM applications in urban analysis (Biljecki & Ito, 2021)
Prompt engineering techniques (Schulhoff et al., 2025)

Summary: StreetLens represents an important advancement in the integration of AI with social science research methodology, achieving automation and scalability of neighborhood environmental assessment through systematic workflow design. While further refinement is needed in assessment validation and bias handling, its innovative human-machine collaboration concept and practical technical solution provide valuable tools and methodological references for related research fields.