2025-11-13T11:19:11.075710

Knowing Unknowns in an Age of Information Overload

Khanna
The technological revolution of the Internet has digitized the social, economic, political, and cultural activities of billions of humans. While researchers have been paying due attention to concerns of misinformation and bias, these obscure a much less researched and equally insidious problem - that of uncritically consuming incomplete information. The problem of incomplete information consumption stems from the very nature of explicitly ranked information on digital platforms, where our limited mental capacities leave us with little choice but to consume the tip of a pre-ranked information iceberg. This study makes two chief contributions. First, we leverage the context of internet search to propose an innovative metric that quantifies information completeness. For a given search query, this refers to the extent of the information spectrum that is observed during web browsing. We then validate this metric using 6.5 trillion search results extracted from daily search trends across 48 nations for one year. Second, we find causal evidence that awareness of information completeness while browsing the Internet reduces resistance to factual information, hence paving the way towards an open-minded and tolerant mindset.
academic

Knowing Unknowns in an Age of Information Overload

Basic Information

  • Paper ID: 2510.10413
  • Title: Knowing Unknowns in an Age of Information Overload
  • Author: Saurabh Khanna (Amsterdam School of Communication Research, University of Amsterdam & Pembroke College, University of Oxford)
  • Classification: cs.CY (Computers and Society)
  • Publication Date: October 12, 2025 (arXiv preprint)
  • Paper Link: https://arxiv.org/abs/2510.10413

Abstract

The internet technology revolution has digitized billions of human social, economic, political, and cultural activities. While researchers have focused on misinformation and bias issues, these concerns obscure a less-studied but equally insidious problem—the uncritical consumption of incomplete information. The problem of incomplete information consumption stems from the inherent ordering of information on digital platforms; our limited cognitive capacity leaves us no choice but to consume only the tip of the pre-sorted information iceberg. This study makes two major contributions: first, it proposes an innovative metric to quantify "information completeness" using internet search as a context; second, it provides causal evidence that awareness of information completeness reduces resistance to factual information while browsing the internet.

Research Background and Motivation

Core Research Question

The core question this research addresses is: In an age of information overload, how can people know what they don't know? Specifically, when we browse the internet, how much of the actual information spectrum do we actually see?

Importance of the Problem

  1. Information Explosion: The global datasphere is projected to grow from 33 zettabytes in 2018 to 175 zettabytes in 2025, with a compound annual growth rate of approximately 61%
  2. Cognitive Limitations: Human cognitive capacity is finite and cannot process exponentially growing information flows
  3. Algorithmic Ordering: Internet information is inherently ordered, with users tending to view only top-ranked results
  4. Social Impact: Incomplete information consumption may lead to bias reinforcement and social fragmentation

Limitations of Existing Research

Existing research primarily focuses on two aspects:

  1. Misinformation Propagation: Studying the divergence between information and objective truth
  2. Algorithmic Fairness: Examining algorithmic bias and its impact on marginalized groups

However, these studies all rely on the existence of verifiable objective truth, whereas subjectivity and diversity of perspectives on the internet make objective truth more the exception than the rule.

Research Motivation

The author argues that we have overlooked an equally important issue: how to quantify and raise awareness of information completeness in the context of information overload and uncritical consumption of incomplete information.

Core Contributions

  1. Innovative Metric: Proposes a dynamic measurement metric for "information completeness" based on text embeddings and information retrieval techniques
  2. Large-Scale Validation: Validates the metric using 6.5 trillion search results (covering 48 countries over one year)
  3. Causal Evidence: Provides randomized controlled experimental evidence that awareness of information completeness reduces resistance to factual information
  4. Open-Source Platform: Develops an experimental open-source web search platform called Sonder that dynamically reports information completeness scores

Methodology

Task Definition

For a given search query q, given a total of N search results, how representative are the first n results (n < N) that are viewed? This differs from assessing whether these n results contain misinformation or bias; rather, it evaluates the completeness of information.

Information Completeness Metric Design

Core Concept

Traditional approaches focus on the relevance between a query and individual search results:

Relevance = cos(q⃗, r⃗ᵢ) = (q⃗ · r⃗ᵢ)/(‖q⃗‖‖r⃗ᵢ‖)

The proposed information completeness metric focuses on semantic similarity between search results and the entire corpus of results:

Icompleteness,i = cos(C⃗, r⃗ᵢ) = (C⃗ · r⃗ᵢ)/(‖C⃗‖‖r⃗ᵢ‖)

Where: C⃗ = Σᵢ₌₁ᴺ wᵢr⃗ᵢ (wᵢ is a weight, potentially based on credibility metrics such as page rank)

Cumulative Information Completeness

Considering the cumulative nature of information consumption, cumulative information completeness is defined as:

Icompleteness,n = cos(C⃗, Σᵢ₌₁ⁿ r⃗ᵢ) = (C⃗ · Σᵢ₌₁ⁿ r⃗ᵢ)/(‖C⃗‖‖Σᵢ₌₁ⁿ r⃗ᵢ‖)

Balancing Relevance and Completeness

Provides a user-controllable balancing mechanism:

Sᵢ = λIᵢ,completeness + (1-λ)Iᵢ,relevance

Where λ ∈ 0,1 controls the weight of completeness and relevance.

Technical Implementation

  1. Text Embedding: Uses Transformer-based sentence-level embeddings (e.g., Sentence-BERT)
  2. Semantic Similarity: Calculates semantic distance between vectors through cosine similarity
  3. Information Completeness Curve: Plots cumulative completeness as a function of the proportion of results viewed

Experimental Setup

Large-Scale Data Validation

Dataset Scale

  • Time Span: November 16, 2021 to November 15, 2022 (one year)
  • Geographic Coverage: 48 countries across 6 continents
  • Data Volume: 6.5 trillion raw search results
  • Daily Average: 57.6 million searches, 18 billion data points
  • Result Depth: Median of 320 search results per query

Validation Method

Validates the metric by comparing information completeness across different countries with media freedom indices (using Reporters Without Borders data).

Randomized Controlled Experiment

Experimental Design

  • Platform: Self-developed Sonder search platform
  • Participants: 876 U.S. adults (recruited via Prolific)
  • Duration: 40 minutes (5-minute pretest + 30-minute interaction + 5-minute posttest)
  • Groups: Treatment group of 434 (displaying information completeness scores), control group of 442 (normal search)

Search Topics

Five broad topics assessing openness of mind:

  1. Patriotism in our country today
  2. Openness toward immigration
  3. Abortion and its legal status
  4. Traditional values in today's society
  5. Laws regarding gun ownership

Experimental Results

Information Completeness Metric Validation

Geographic Variation Analysis

  • Lowest Completeness: Middle East and North Africa region (approximately 25% completeness on first page)
  • Highest Completeness: North America (approximately 62% completeness on first page)
  • Statistical Relationship: For each unit increase in media restriction score, information completeness decreases by 0.28 percentage points (p < 0.001)

Regional Fixed Effects

After adding regional fixed effects, the effect size decreases to 0.17 percentage points (p < 0.001), indicating significant country-level differences remain within regions.

Behavioral Experiment Results

Openness of Mind Improvement (Result O1)

  • Overall Effect: Treatment group's openness of mind improved by 0.076 standard deviation units (p = 0.207, not significant)
  • Fact Resistance: Significantly reduced by 0.212 standard deviation units (p = 0.003, statistically significant)
  • Dogmatism: Reduced by 0.048 standard deviation units (p = 0.432, not significant)
  • Belief Personalization: Reduced by 0.012 standard deviation units (p = 0.777, not significant)
  • Liberal Thinking: Reduced by 0.032 standard deviation units (p = 1.302, not significant)

Browsing Behavior Changes (Result O2)

  • Search Depth: Treatment group's lowest-ranked viewed results extended an average of 6.14 positions further down (p < 0.001)
  • Click Count: Treatment group clicked on an average of 2.182 additional results (p = 0.312, not significant)
  • Completeness Improvement: Treatment group's clicked results had information completeness scores 7.6 percentage points higher (p = 0.001)
  1. Early Solutions (1990s): Archie, Gopher, WAIS, and other keyword-based systems
  2. Google's Rise (1998): PageRank algorithm revolutionized link quality assessment
  3. Modern Solutions: AI and machine learning-driven personalized search

Information Quality Research

  • Misinformation Detection: Focuses on divergence between information and objective truth
  • Algorithmic Fairness: Studies algorithmic bias and its impact on marginalized groups
  • Filter Bubbles: Information echo chambers resulting from personalized recommendations

Conclusions and Discussion

Main Conclusions

  1. Metric Validity: The information completeness metric effectively reflects media freedom levels across different countries and regions
  2. Cognitive Impact: Information completeness awareness primarily improves knowledge-related dimensions (reducing fact resistance), with limited impact on interpersonal dimensions
  3. Behavioral Change: Users actively explore deeper, more complete search results

Limitations

  1. Technical Dependency: Metric quality depends on the quality of text embeddings, which may be affected by training data bias
  2. Cultural Limitations: The concept of openness of mind (AOT) originates from Western psychology, with limited cross-cultural applicability
  3. Understanding Threshold: Participants' understanding of the information completeness concept affects treatment effectiveness

Future Directions

  1. Magnitude Effects: Study how variations in information completeness scores affect openness of mind
  2. Social Media Extension: Extend research to social media platforms with personalized information sources
  3. Educational Interventions: Develop educational programs to raise public awareness of information completeness

In-Depth Evaluation

Strengths

  1. Problem Innovation: Identifies and quantifies information incompleteness, an overlooked yet important problem
  2. Methodological Rigor: Combines large-scale observational data with randomized controlled experiments, providing substantial empirical evidence
  3. Practical Value: Develops an open-source search platform with real-world application potential
  4. Interdisciplinary Integration: Integrates theories and methods from information retrieval, psychology, political science, and other fields

Limitations

  1. Causal Inference Constraints: Country-level analysis is primarily correlational, lacking strong causal identification
  2. Sample Representativeness: Experiments limited to U.S. adults; generalizability of results remains to be verified
  3. Long-Term Effects Unknown: Experiments observe only short-term effects; long-term impacts remain unclear
  4. Algorithm Transparency: The "black box" nature of text embedding algorithms may affect metric interpretability

Impact

  1. Academic Contribution: Provides a new theoretical framework and measurement tool for information quality assessment
  2. Policy Significance: Offers objective metrics for evaluating national information environment quality
  3. Technological Application: Provides directions for improving search engines and information platforms
  4. Social Value: Helps enhance public information literacy and critical thinking

Application Scenarios

  1. Search Engine Optimization: Helps users better evaluate the completeness of search results
  2. Media Regulation: Provides tools for governments and organizations to assess information environment quality
  3. Education and Training: Used to cultivate students' and the public's information literacy
  4. Academic Research: Provides new measurement tools and theoretical frameworks for related field research

References

This paper cites rich interdisciplinary literature, covering:

  • Information retrieval and natural language processing (Vaswani et al., 2017; Devlin et al., 2018)
  • Psychology and cognitive science (Baron, 2000; Stanovich & West, 2007)
  • Political science and communication studies (Dahlberg, 2001; Lazer et al., 2020)
  • Computational social science (Hofman et al., 2021; Vosoughi et al., 2018)

This research presents an important and innovative perspective in the age of information overload. Through rigorous methodology and large-scale empirical research, it makes significant contributions to understanding and improving our interaction with digital information. Despite certain limitations, its theoretical value and practical significance merit attention and further development.