2025-11-24T16:16:18.122167

Characterizing Web Search in The Age of Generative AI

Kirsten, Perdekamp, Upadhyay et al.
The advent of LLMs has given rise to a new type of web search: Generative search, where LLMs retrieve web pages related to a query and generate a single, coherent text as a response. This output modality stands in stark contrast to traditional web search, where results are returned as a ranked list of independent web pages. In this paper, we ask: Along what dimensions do generative search outputs differ from traditional web search? We compare Google, a traditional web search engine, with four generative search engines from two providers (Google and OpenAI) across queries from four domains. Our analysis reveals intriguing differences. Most generative search engines cover a wider range of sources compared to web search. Generative search engines vary in the degree to which they rely on internal knowledge contained within the model parameters v.s. external knowledge retrieved from the web. Generative search engines surface varying sets of concepts, creating new opportunities for enhancing search diversity and serendipity. Our results also highlight the need for revisiting evaluation criteria for web search in the age of Generative AI.
academic

Characterizing Web Search in The Age of Generative AI

Basic Information

  • Paper ID: 2510.11560
  • Title: Characterizing Web Search in The Age of Generative AI
  • Authors: Elisabeth Kirsten, Jost Grosse Perdekamp, Mihir Upadhyay, Krishna P. Gummadi, Muhammad Bilal Zafar
  • Institutions: Ruhr University Bochum, UAR RC Trust, MPI-SWS
  • Classification: cs.IR cs.AI
  • Publication Date: October 13, 2025
  • Paper Link: https://arxiv.org/abs/2510.11560

Abstract

The emergence of large language models (LLMs) has catalyzed a new form of web search: generative search, wherein LLMs retrieve web pages relevant to a query and generate a single, coherent text as a response. This output paradigm contrasts sharply with traditional web search, which returns a ranked list of independent web pages. This paper investigates the dimensions along which generative search outputs differ from traditional web search. The research compares Google's traditional search engine with four generative search engines from Google and OpenAI, covering queries across four domains. The analysis reveals striking differences: most generative search engines cover a broader range of information sources than traditional web search; generative search engines differ in their reliance on model parametric knowledge versus external knowledge retrieved from the web; and generative search engines present different sets of concepts, creating new opportunities for enhanced search diversity and serendipitous discovery.

Research Background and Motivation

Problem Definition

With the rise of generative AI, web search is evolving toward greater reliance on large language models. Traditional search engines return a ranked list of approximately ten search results, while generative search systems provide natural language answers through LLM chatbots. This paradigm shift introduces three key differences:

  1. Different Output Format: Traditional search returns independent web pages; generative search forms a single coherent text block
  2. Broader Coverage: Generative search may synthesize content from far more than ten sources
  3. Mixed Knowledge Sources: Combining external retrieved information with LLM internal knowledge

Research Significance

Understanding these differences is crucial for evaluating search quality, information diversity, and user experience. Existing search evaluation metrics are primarily designed for ranked lists and cannot be directly applied to the synthesized outputs of generative search.

Existing Limitations

  • Lack of systematic comparative research between generative and traditional search systems
  • Existing evaluation frameworks are unsuitable for generative search outputs
  • Insufficient in-depth analysis of information source selection and concept coverage in generative search

Core Contributions

  1. First Systematic Comparison: Comprehensive source and content analysis comparing traditional and generative search
  2. Multi-dimensional Analysis Framework: Evaluation of search systems across three dimensions: information source diversity, internal/external knowledge dependency, and concept coverage
  3. Large-scale Empirical Study: Comprehensive experiments spanning 6 datasets and 4,606 queries
  4. Timeliness Analysis: Assessment of how different search systems handle time-sensitive queries
  5. Innovative Evaluation Methods: Proposal of new evaluation standards and methods applicable to generative search

Methodology Details

Research Questions

This study aims to answer three core research questions:

  • RQ1: To what extent do generative AI models leverage their ability to process more search results to access more diverse information sources?
  • RQ2: To what proportion do generative search engines rely on external web knowledge versus internal LLM knowledge?
  • RQ3: Does reliance on more diverse information sources and the use of internal knowledge enable generative AI models to produce more diverse outputs?

Experimental Architecture

Search Engine Selection

  • Traditional Search: Google Organic search results
  • Generative Search:
    • Google AI Overview (AIO)
    • Gemini-2.5-Flash with Google Search
    • GPT-4o Search (GPT-Search)
    • GPT-4o with Search Tool (GPT-Tool)

Analysis Dimensions

  1. Source Analysis:
    • Link count statistics
    • Website popularity ranking (based on Tranco list)
    • Information source type classification (using Google content categories and custom classifications)
    • Overlap analysis with traditional search results
  2. Content Analysis:
    • Response length and structure analysis
    • Concept coverage assessment (using LLooM framework)
    • Concept density calculation
    • Cross-engine concept overlap analysis

Technical Innovations

  1. Concept Induction Method: Employs LLooM (LLM-powered topic inference framework) for concept discovery and classification
  2. Multi-level Overlap Analysis: Overlap calculation from URL level to domain level
  3. Temporal Dimension Evaluation: Assessment of timeliness through trend queries and temporal stability analysis
  4. Cross-geographic Verification: Experimental validation across two geographic locations: the United States and Germany

Experimental Setup

Datasets

The study employs 6 datasets totaling 4,606 queries:

  1. MS Marco (1,000 queries): Open-domain retrieval dataset from real Bing search queries
  2. WildChat (1,750 queries): Information-seeking queries filtered from ChatGPT user interactions
  3. AllSides (332 queries): Queries generated based on political topics
  4. Regulatory Actions (649 queries): Time-sensitive queries about Trump administration executive orders
  5. Science Queries (453 queries): Scientific topic queries based on ACM Computing Classification System
  6. Products (422 queries): Shopping queries based on top 2023 Amazon products

Evaluation Metrics

  1. Source Metrics:
    • Number of links per query
    • Website popularity ranking
    • Information source type distribution
    • URL/domain overlap rate
  2. Content Metrics:
    • Response length (character count)
    • Concept coverage rate
    • Concept density (number of concepts/text length)
    • Concept overlap (Jaccard similarity)
  3. Timeliness Metrics:
    • Trend query handling success rate
    • Temporal stability (consistency across time points)

Implementation Details

  • All queries conducted in English
  • Execution across two geographic locations: United States and Germany
  • Generative model temperature parameter set to 0 (if supported)
  • Maximum new tokens set to 1,000
  • Experimental period: July-September 2025

Experimental Results

Main Findings

Significant Source Differences

  1. External Knowledge Dependency Variations:
    • GPT-Tool cites only 0.4 web pages per query on average
    • AIO, Gemini, and GPT-Search cite 8.6, 8.5, and 4.1 web pages respectively
    • Traditional search consistently returns 10 results
  2. Information Source Popularity:
    • Traditional search: 89% of websites in Tranco 1M list
    • Generative search: 81%-86% in list
    • Websites cited by GPT-Tool rank higher (median 1124 vs. traditional search 2352)
  3. Low Information Source Overlap:
    • AIO and traditional search top 10 results overlap <50%
    • Overlap with top 100 results does not exceed 60%
    • Products dataset overlap only 30%

Content Analysis Findings

  1. Response Length Variations:
    • Gemini longest (average 2505±552 characters)
    • GPT-Tool shortest (average 1018±219 characters)
    • AIO medium length but with more links
  2. Similar Concept Coverage:
    • Traditional search (all results): 78%±14%
    • GPT-Search: 78%±16%
    • Gemini: 77%±14%
    • AIO: 74%±16%
    • GPT-Tool: 71%±16%
  3. Ambiguous Query Handling:
    • Traditional search performs best on low-coverage queries (67% median coverage)
    • AIO: 55%
    • GPT-Tool: 48%

Timeliness Analysis

  1. Trend Query Handling:
    • AIO triggered in only 3% of trend queries
    • GPT-Search achieves highest concept coverage (72%)
    • GPT-Tool performs poorly on timeliness queries (51%)
  2. Temporal Stability:
    • Traditional search most stable (45% overlap rate)
    • Gemini second (40%)
    • AIO most variable (18% overlap rate)

Ablation Studies

Investigation of the impact of different search context sizes (low/medium/high) for GPT models:

  • Search context size has no significant impact on information source selection
  • No apparent difference in content generation quality
  • Concept coverage rates remain essentially consistent

Traditional Search Evaluation

  • Traditional metrics including relevance, diversity, freshness, and coverage
  • Ranking evaluation methods such as nDCG and α-nDCG
  • Diversity research on political bias, geographic bias, and commercial bias

Large Language Model Evaluation

  • Capability assessment in question answering, summarization, factual grounding, and tool use
  • Retrieval-augmented generation (RAG) techniques
  • Query understanding and ranking applications

Generative Search Research

  • Verifiability, credibility, and accuracy assessment
  • Robustness to adversarial factual questions
  • Bias and fairness issues
  • New evaluation principles and benchmarks

Conclusions and Discussion

Main Conclusions

  1. Information Source Diversity: Generative search engines access a broader range of information sources, but this does not necessarily increase concept coverage
  2. Internal/External Knowledge Balance: Generative search engines differ dramatically in their reliance on internal versus external knowledge
  3. Comparable Concept Coverage: Despite different information sources, overall concept coverage is similar to traditional search
  4. Ambiguous Query Challenges: Traditional search maintains advantages in handling ambiguous queries
  5. Timeliness Differences: Models relying on internal knowledge perform poorly on time-sensitive queries

Limitations

  1. Query Scope Limitations: Covers only selected query workloads; does not consider multi-turn conversational search
  2. Language and Geographic Limitations: Uses only English queries; tested in only two countries
  3. Content Analysis Depth: Analyzes only the top 10 traditional search results; assumes users rarely click beyond links
  4. Time Window Limitations: Limited evaluation time window; requires longer-term longitudinal studies
  5. Output Determinism: Uses only one output per query; does not measure output variability

Future Directions

  1. New Evaluation Methods: Develop evaluation methods that simultaneously consider information source diversity, concept coverage, and synthesis behavior
  2. Multilingual Extension: Extend to multilingual queries and multi-turn interactions
  3. Deep Content Analysis: Compare summarization analysis with full-page content assessment
  4. Longitudinal Studies: Capture temporal drift from model updates and emerging events
  5. Fact-checking Integration: Combine coverage metrics with fact-checking and credibility assessment

In-depth Evaluation

Strengths

  1. Comprehensive Research Design: Systematic comparison across multiple search engines, datasets, and geographic locations
  2. Methodological Innovation: First application of concept induction methods to search engine comparison
  3. High Practical Value: Provides important insights for search engine design and evaluation
  4. Timeliness Focus: Particular attention to handling of time-sensitive queries
  5. Objective and Balanced: Presents both advantages and limitations of generative search

Weaknesses

  1. LLM-dependent Concept Analysis: Use of LLM for concept induction may introduce bias
  2. Strong Assumptions: Assumes users do not click links, do not go beyond top 10 results, etc.
  3. Limited Evaluation Metrics: Primarily focuses on concept coverage; lacks accuracy and credibility assessment
  4. Short Time Span: Only two months of temporal stability analysis may be insufficient

Impact

  1. Academic Contribution: Provides new theoretical framework and methods for generative search evaluation
  2. Practical Value: Offers important reference for search engine developers and users
  3. Policy Implications: Provides scientific evidence for search engine regulation and standard-setting
  4. Future Research: Establishes foundation for subsequent related research

Applicable Scenarios

  1. Search Engine Evaluation: Applicable to comparative evaluation of traditional and generative search engines
  2. Product Development: Provides guidance for search product design and optimization
  3. Academic Research: Offers methods and data for information retrieval and AI field research
  4. User Education: Helps users understand the characteristics and applicable scenarios of different search tools

References

The paper cites 41 relevant references covering important works in traditional search evaluation, large language model evaluation, generative search, and other related research domains, providing a solid theoretical foundation for the research.


This research makes important contributions to understanding the characteristics of web search in the age of generative AI, not only revealing key differences between traditional and generative search, but also providing new perspectives and methods for the design and evaluation of future search systems.