2025-11-18T18:43:13.867270

StreetLens: Enabling Human-Centered AI Agents for Neighborhood Assessment from Street View Imagery

Kim, Jang, Chiang et al.
Traditionally, neighborhood studies have used interviews, surveys, and manual image annotation guided by detailed protocols to identify environmental characteristics, including physical disorder, decay, street safety, and sociocultural symbols, and to examine their impact on developmental and health outcomes. Although these methods yield rich insights, they are time-consuming and require intensive expert intervention. Recent technological advances, including vision language models (VLMs), have begun to automate parts of this process; however, existing efforts are often ad hoc and lack adaptability across research designs and geographic contexts. In this paper, we present StreetLens, a user-configurable human-centered workflow that integrates relevant social science expertise into a VLM for scalable neighborhood environmental assessments. StreetLens mimics the process of trained human coders by focusing the analysis on questions derived from established interview protocols, retrieving relevant street view imagery (SVI), and generating a wide spectrum of semantic annotations from objective features (e.g., the number of cars) to subjective perceptions (e.g., the sense of disorder in an image). By enabling researchers to define the VLM's role through domain-informed prompting, StreetLens places domain knowledge at the core of the analysis process. It also supports the integration of prior survey data to enhance robustness and expand the range of characteristics assessed in diverse settings. StreetLens represents a shift toward flexible and agentic AI systems that work closely with researchers to accelerate and scale neighborhood studies. StreetLens is publicly available at https://knowledge-computing.github.io/projects/streetlens.
academic

StreetLens: Enabling Human-Centered AI Agents for Neighborhood Assessment from Street View Imagery

Basic Information

  • Paper ID: 2506.14670
  • Title: StreetLens: Enabling Human-Centered AI Agents for Neighborhood Assessment from Street View Imagery
  • Authors: Jina Kim, Leeje Jang, Yao-Yi Chiang, Guanyu Wang, Michelle C. Pasco (University of Minnesota)
  • Classification: cs.HC (Human-Computer Interaction), cs.AI (Artificial Intelligence)
  • Conference: The 1st ACM SIGSPATIAL International Workshop on Human-Centered Geospatial Computing (GeoHCC '25)
  • Paper Link: https://arxiv.org/abs/2506.14670
  • Project Link: https://knowledge-computing.github.io/projects/streetlens

Abstract

Traditional neighborhood research relies on interviews, surveys, and manual image annotation based on detailed protocols to identify environmental characteristics, including physical disorder, decay, street safety, and sociocultural symbols, and to investigate their impacts on development and health outcomes. While these methods generate rich insights, they are time-consuming and require intensive expert intervention. This paper proposes StreetLens, a user-configurable human-centered workflow that integrates relevant social science expertise into vision-language models (VLMs) for scalable neighborhood environmental assessment.

Research Background and Motivation

Problem Definition

Neighborhood environmental assessment traditionally faces the following challenges:

  1. Labor Intensity: Requires trained coders for systematic social observation (SSO), with multiple coders annotating the same image to ensure reliability
  2. Scalability Limitations: Manual methods are difficult to scale to large geographic areas and diverse research contexts
  3. Expert Dependency: Requires continuous involvement and supervision of domain experts
  4. Standardization Difficulties: Lacks adaptive systematic approaches across research designs and geographic contexts

Research Significance

Neighborhood environmental characteristic assessment is crucial for understanding how environments influence:

  • Adolescent development
  • Mental health
  • Social cohesion
  • Public health outcomes

Limitations of Existing Methods

  1. Traditional Approaches: While providing valuable insights, the process is cumbersome, expert-dependent, and difficult to scale
  2. Existing VLM Applications: Mostly ad-hoc applications lacking structured frameworks, unable to systematically "train" VLMs to work like human coders
  3. Lack of Feedback Mechanisms: Existing methods typically accept VLM results directly without providing researcher feedback

Core Contributions

  1. Proposed StreetLens Workflow: The first end-to-end, researcher-centered systematic social observation workflow that simulates human coder training processes
  2. Human-Machine Collaboration Framework: Incorporates domain knowledge as a core component of the analysis process through role prompting
  3. Automated Prompt Tuning: Automatically generates domain-specific prompts based on relevant research literature and coding manuals
  4. Enhanced Interpretability: Provides explanations of VLM decisions and feedback mechanisms
  5. Open-Source Accessibility: Provides Google Colab notebooks to lower technical barriers

Methodology Details

Task Definition

Inputs:

  • Research area specifications
  • Coding manuals and protocols
  • Relevant academic papers
  • Example annotations
  • Street View Images (SVI)

Outputs:

  • Structured environmental feature assessments
  • Semantic annotations ranging from objective features (e.g., number of cars) to subjective perceptions (e.g., sense of disorder)
  • Assessment explanations and feedback

System Architecture

StreetLens comprises four core modules:

M1. Data Processor

  • Function: Collects and organizes input materials
  • Input Processing:
    • Research area selection (based on U.S. Census TIGER road data, sampled at 5-meter intervals)
    • Material upload (coding manuals, protocols, relevant papers, example annotations)
    • Google Street View image retrieval
  • Output: Structured input dataset

M2. Automated Prompt Tuning

  • Role Generation: Generates VLM professional role descriptions based on relevant paper abstracts
    Prompt Template:
    "You are an expert in the following fields and the author of the paper abstracts provided here: [paper abstracts]. Based on the expertise demonstrated, generate a general professional role description of yourself in one to two sentences, starting with 'You are' written in the second person."
    
  • Task Classification: Distinguishes between subjective perception tasks vs. objective detection tasks
    Classification Prompt:
    "You are a classifier of annotation tasks... If it asks to rate/assess overall condition or quality, label as perception. If it asks to detect, count, or verify specific objects, label as object_detection."
    
  • Coding Manual Processing: Converts question-answer pairs into structured prompts

M3. Vision-Language Model Processor

  • Model Selection: Uses open-source lightweight VLM InternVL3-2B
    • Image Encoder: InternViT-300M-448px-V2_5
    • Language Model: Qwen2.5-1.5B
  • Processing Pipeline:
    1. Image encoding and embedding
    2. Integration with prompts generated by M2
    3. Utilization of context learning from example image-answer pairs
    4. Generation of environmental feature assessments

M4. Feedback Provider

  • Explanation Generation: Provides reasoning explanations for VLM assessments
  • Interpretability: Helps researchers understand the AI agent's decision-making process
  • Example: Explanation for 'Decay 1' measurement: "There are only slight cracks, and any potholes present have been fixed or covered"

Technical Innovations

  1. Domain Knowledge Integration: Embeds social science expertise into VLMs through role prompting
  2. Task Adaptation: Automatically identifies and adapts to different types of assessment tasks (perception vs. detection)
  3. Context Learning: Leverages expert-annotated examples to enhance model performance
  4. Human-Machine Collaboration Design: Simulates human coder training processes, including literature review, protocol study, and example examination

Case Study

Research Background

Based on Pasco and White (2020)'s family social science research:

  • Research Objective: Assess the relationship between neighborhood environment and adolescent racial labeling behavior
  • Methodology: Trained human coders using systematic social observation (SSO) protocols
  • Assessment Content: Physical decay levels, sociocultural symbols, etc.
  • Validation Method: Assessed inter-coder reliability through intraclass correlation coefficient (ICC)

StreetLens Application

  • Participates in the assessment process as an additional intelligent coder
  • Uses relevant research literature to define the VLM role
  • Processes specific questions from coding manuals (e.g., "Disorder 3")
  • Provides interpretable assessment results

Experimental Setup

Data Sources

  • Street View Images: Google Street View imagery
  • Geographic Data: U.S. Census TIGER road data
  • Sampling Strategy: Predefined point locations at 5-meter intervals
  • Case Data: Manual annotations from the original case study

Technical Implementation

  • Deployment Platform: Google Colab notebook
  • Server: University of Minnesota, connected via Cloudflare
  • User Interface: Modular button design supporting independent exploration of module functions

Evolution of Traditional Methods

  1. Early Research: Sampson and Raudenbush (1999) used video to assess physical disorder in 23,000 street segments in Chicago
  2. Virtual Audits: Subsequent research adopted Google Earth and Street View for remote assessment
  3. Computer Vision Methods: Detection of urban greenery, sidewalk quality, and other physical features

Current VLM Applications

  • Walkability Assessment: Using VLMs to evaluate urban walkability
  • Structured Descriptions: Generating structured descriptions of urban environments
  • Object Detection: Detecting specific objects in audit categories

StreetLens Advantages

Compared to existing work, StreetLens provides:

  • End-to-end researcher-centered workflow
  • Systematic VLM training process simulation
  • Adaptability across research designs and geographic contexts

Conclusions and Discussion

Main Conclusions

  1. Workflow Effectiveness: StreetLens successfully simulates human coder training and assessment processes
  2. Domain Knowledge Integration: Effectively integrates social science expertise through role prompting
  3. Scalability Enhancement: Significantly improves the scalability of neighborhood environmental assessment
  4. Human-Machine Collaboration: Achieves effective collaboration between AI and researchers

Limitations

  1. Model Bias: VLMs may exhibit bias when interpreting sociocultural contexts in diverse neighborhoods
  2. Assessment Validation: Requires more systematic evaluation methods (e.g., ICC) to validate the reliability of automated coding
  3. Feedback Mechanisms: Current feedback loops are limited, requiring more interactive improvement features

Future Directions

  1. Enhanced Human-Machine Interaction:
    • Add feedback loops allowing researchers to explain and improve StreetLens decisions
    • Explore different types of automated coders
    • Develop automated methods more closely resembling human coding
  2. Improved Evaluation Methods:
    • Use intraclass correlation coefficient (ICC) treating automated coders as human annotators
    • Provide feedback mechanisms to monitor output reasonableness and reliability
    • Enhance convenience of result review and improvement
  3. Bias Mitigation:
    • Evaluate potential bias sources
    • Apply participatory design methods in collaboration with domain experts
    • Ensure responsible and human-centered characteristics of the tool

In-Depth Evaluation

Strengths

  1. Strong Innovation: First to propose a VLM workflow that systematically simulates human coder training processes
  2. High Practical Value: Addresses actual pain points in neighborhood research with broad application prospects
  3. Reasonable Technical Solution: Clear four-module design with feasible technical approach
  4. Open-Source Friendly: Provides Google Colab implementation, lowering usage barriers
  5. Interdisciplinary Integration: Effectively combines AI technology with social science methodology

Weaknesses

  1. Insufficient Evaluation: Lacks systematic comparative experiments with human coders
  2. Bias Risk: Insufficient discussion of VLM bias in sociocultural interpretation
  3. Unverified Generalization: Based on only one case study, lacking multi-scenario validation
  4. Limited Technical Details: Limited analysis of specific prompt engineering strategies and effects

Impact

  1. Academic Contribution: Provides a new paradigm for human-machine collaboration in geospatial computing
  2. Practical Value: Can significantly improve efficiency and scale of neighborhood research
  3. Cross-Disciplinary Impact: Applicable to urban planning, public health, sociology, and other fields
  4. Methodological Innovation: Provides a reference framework for VLM applications in domain-specific tasks

Applicable Scenarios

  1. Urban Research: Large-scale neighborhood environmental feature assessment
  2. Public Health: Research on environmental factors' impact on health
  3. Sociological Research: Analysis of relationships between community characteristics and social phenomena
  4. Urban Planning: Visual feature-based urban environment assessment

Ethical Considerations

The paper explicitly acknowledges potential social bias in machine learning models, particularly when interpreting sociocultural contexts in diverse neighborhoods. The authors plan to evaluate potential bias sources in future work and collaborate with domain experts using participatory design methods to ensure StreetLens functions as a responsible, human-centered tool.

References

The paper cites important works in relevant fields, including:

  • Classical research on neighborhood environmental assessment (Sampson & Raudenbush, 1999)
  • Development of virtual audit methods (Odgers et al., 2012; Clarke et al., 2010)
  • VLM applications in urban analysis (Biljecki & Ito, 2021)
  • Prompt engineering techniques (Schulhoff et al., 2025)

Summary: StreetLens represents an important advancement in the integration of AI with social science research methodology, achieving automation and scalability of neighborhood environmental assessment through systematic workflow design. While further refinement is needed in assessment validation and bias handling, its innovative human-machine collaboration concept and practical technical solution provide valuable tools and methodological references for related research fields.