2025-11-12T22:19:11.052121

Preprint: Poster: Did I Just Browse A Website Written by LLMs?

He, Govindan, Madhyastha

Increasingly, web content is automatically generated by large language models (LLMs) with little human input. We call this "LLM-dominant" content. Since LLMs plagiarize and hallucinate, LLM-dominant content can be unreliable and unethical. Yet, websites rarely disclose such content, and human readers struggle to distinguish it. Thus, we must develop reliable detectors for LLM-dominant content. However, state-of-the-art LLM detectors are inaccurate on web content, because web content has low positive rates, complex markup, and diverse genres, instead of clean, prose-like benchmark data SoTA detectors are optimized for. We propose a highly reliable, scalable pipeline that classifies entire websites. Instead of naively classifying text extracted from each page, we classify each site based on an LLM text detector's outputs of multiple prose-like pages to boost accuracies. We train and evaluate our detector by collecting 2 distinct ground truth datasets totaling 120 sites, and obtain 100% accuracies testing across them. In the wild, we detect a sizable portion of sites as LLM-dominant among 10k sites in search engine results and 10k in Common Crawl archives. We find LLM-dominant sites are growing in prevalence and rank highly in search results, raising questions about their impact on end users and the overall Web ecosystem.

academic

Preprint: Poster: Did I Just Browse A Website Written by LLMs?

Basic Information

Paper ID: 2507.13933
Title: Poster: Did I Just Browse A Website Written by LLMs?
Authors: Sichang Steven He, Ramesh Govindan, Harsha V. Madhyastha (University of Southern California)
Classification: cs.NI cs.AI cs.CL cs.IR
Publication Venue/Date: IMC '25 (2025 ACM Internet Measurement Conference), October 28–31, 2025, Madison, WI, USA
Paper Link: https://doi.org/10.1145/3730567.3768603

Abstract

With the rise of large language models (LLMs), an increasing volume of web content is being automatically generated by LLMs with minimal human input. The authors refer to such content as "LLM-dominated" content. Due to plagiarism and hallucination issues inherent in LLMs, LLM-dominated content may be unreliable and unethical. However, websites rarely disclose such content, and human readers find it difficult to distinguish. Therefore, developing reliable LLM-dominated content detectors is imperative. Existing state-of-the-art LLM detectors perform poorly on web content due to its low positive sample rate, complex labeling, and diverse types—characteristics that differ from the clean prose benchmarks on which existing detectors are optimized.

This paper proposes a highly reliable and scalable pipeline for classifying entire websites. Rather than simply classifying text extracted from individual pages, the method classifies each site based on the outputs of an LLM text detector applied to multiple prose pages, thereby improving accuracy. Through collection and evaluation of two distinct real-world datasets (120 sites total), the approach achieves 100% accuracy in cross-dataset testing. In practical applications, a substantial proportion of LLM-dominated sites were detected among 10,000 sites each from search engine results and Common Crawl archives, revealing that the prevalence of these sites is growing and they rank highly in search results.

Research Background and Motivation

Problem Definition

Core Problem: How to reliably detect "LLM-dominated" website content generated by large language models on the web
Problem Significance:
- LLM-generated content suffers from plagiarism and hallucination issues that may mislead users
- The EU AI Act requires disclosure of AI usage, yet websites rarely comply
- Humans struggle to distinguish LLM-generated content

Limitations of Existing Methods

The authors identify three key challenges:

Inaccuracy of Text Detectors: Existing state-of-the-art detectors perform poorly in real-world settings with low false positive rate requirements
Web Content Noise: Detectors are designed for clean prose and perform poorly on diverse web content types (e.g., link lists, privacy statements)
Lack of Real-World Labels: While many benchmark datasets exist for text fragment detection, webpage-level datasets are lacking

Research Motivation

AI services enable anyone to generate web content in bulk at low cost
Users have begun complaining about encountering LLM-dominated articles online
There is a need to develop reliable detection methods to protect user experience and web ecosystem integrity

Core Contributions

Proposes a website-level LLM content detection pipeline: Improves accuracy by aggregating detection results across multiple pages
Constructs two real-world datasets from different sources: 120 websites total for training and evaluation
Achieves 100% cross-dataset accuracy: Demonstrates excellent performance on strict out-of-distribution testing
Provides large-scale empirical study: Analyzes 20,000 real websites and reveals growth trends of LLM-dominated sites
Uncovers important web ecosystem insights: LLM-dominated websites rank highly in search results and their prevalence is continuously increasing

Methodology Details

Task Definition

Input: Website URL
Output: Binary classification result (LLM-dominated vs. human-authored)
Constraint: Requires at least 15 filterable pages per website

Model Architecture

1. Text Acquisition

Randomly samples pages from website sitemaps or Wayback Machine content indices
Accesses and renders HTML pages using Chromium
Extracts main text content using the Trafilatura library

2. Scoring and Filtering

Employs the Binoculars detector for LLM text detection
Applies strict filtering rules:
- Filters short text
- Filters content with high proportions of lists, tables, and links
- Filters duplicate text within the site
Ensures most filtered text is in prose format

3. Aggregate Analysis

Samples 15-20 pages per website
Computes Binoculars scores for each page
Uses the 9 deciles of scores as feature vectors
Trains a linear support vector machine (SVM) for website classification

Technical Innovations

Aggregation Strategy: Rather than relying on individual page classification results, improves robustness by analyzing the distribution of scores across multiple pages
Intelligent Filtering: Designs specialized filtering strategies tailored to web content diversity
Distribution Features: Uses deciles to capture distributional characteristics of website content scores
Website-Level Detection: Elevates detection from page-level to website-level, better aligning with practical application requirements

Experimental Setup

Datasets

Baseline Dataset (120 websites, 2,630 filtered pages)

Company Dataset:
- 30 human-authored company websites (from Russell 2000 stock index)
- 30 corresponding LLM-generated websites (using Wix.com's AI website builder)
Personal Dataset:
- 30 personal websites (from IndieWeb Blogs)
- 30 corresponding LLM-generated websites (using B12.io)

Wild Dataset

Search Engine Results: 17,036 websites (10,232 valid websites after filtering)
Common Crawl: 10,479 randomly sampled websites (2020-2025)

Evaluation Metrics

Accuracy
False Positive Rate (FPR)
Out-of-distribution generalization performance

Baseline Methods

Binoculars detector (page-level)
Comparative testing with 11 other text detectors

Implementation Details

Uses Binoculars as the base detector
Linear SVM for final classification
Samples 15-20 pages per website
Uses 9 deciles as features

Experimental Results

Main Results

Baseline Dataset Performance

Cross-dataset Accuracy: 100% (Company training → Personal testing, and vice versa)
Binoculars Page-Level Accuracy: Maximum 93%
SVM Website-Level Accuracy: 100% (complete separation of LLM and human websites)

Wild Detection Results

Search Engine Results:
- Detected 1,019 LLM-dominated websites (9.96%)
- LLM websites show no significant disadvantage in search rankings
- Discovered boundary ambiguity phenomenon (websites with partial LLM content)
Common Crawl Analysis:
- Overall detection rate: 4.30% (451/10,479)
- Websites post-ChatGPT release: 7.25% (358/4,938)
- New websites in 2024-2025: 10.08% (77/764)
- False positive rate: 1.22% (16/1,315, pre-ChatGPT websites)

Key Findings

Growth Trend: The proportion of LLM-dominated websites increases significantly over time
Search Engine Bias: The proportion of LLM websites in search results far exceeds random sampling
Ranking Impact: Search engines have not effectively penalized LLM-dominated content
Content Characteristics: LLM websites are typically generic blogs with heavy advertising and false author information

Ablation Studies

Effectiveness of aggregation analysis: Even with single-page detector accuracy of only 93%, website-level detection achieves 100%
Importance of filtering strategy: Significantly reduces noise impact on detection performance

Text Detection Field

Existing work primarily focuses on text fragment-level detection
Detectors like Binoculars perform well under various attacks
However, accuracy is insufficient in real web environments

Web Content Analysis

Lacks detection methods tailored to webpage content characteristics
Existing methods do not account for web content diversity and noise

AI-Generated Content Detection

Primarily concentrated in the text domain
Lacks research on ecosystem-wide impacts of AI-generated content

Conclusions and Discussion

Main Conclusions

The proposed aggregation detection pipeline demonstrates superior performance in website-level LLM content detection
LLM-dominated websites are growing rapidly on the web, particularly in search results
Existing search engines fail to effectively identify and downrank LLM content
The web ecosystem faces significant impacts from AI-generated content

Limitations

False Positive Issues: A 1.22% false positive rate remains
Boundary Ambiguity: Some websites contain mixed content, making accurate classification difficult
Dataset Scale: The baseline dataset is relatively small (120 websites)
Detector Dependency: Performance is influenced by the quality of the underlying text detector

Future Directions

Investigate motivations and methods of LLM content generators
Extend to detection of AI-generated images and other AI-generated content
Quantify the impact of AI-generated content on web ecosystem
Improve detection methods to handle mixed-content websites

In-Depth Evaluation

Strengths

Problem-Driven Approach: Addresses an important contemporary issue in the web environment
Methodological Innovation: Aggregation method that elevates detection from page-level to website-level
Rigorous Experimentation: Cross-dataset validation ensures method generalizability
Large-Scale Validation: Testing on 20,000 real websites is convincing
Important Findings: Reveals growth trends of LLM content on the web

Weaknesses

Baseline Dataset Limitations: Only 120 websites may lack sufficient representativeness
Detector Selection: Over-reliance on Binoculars performance
Boundary Handling: Incomplete strategies for handling mixed-content websites
Dynamic Adaptability: Does not account for rapid LLM technology evolution affecting detection

Impact

Academic Contribution: First systematic study of website-level LLM content detection
Practical Value: Provides effective tools for search engines and content platforms
Social Significance: Helps maintain web content quality and user experience
Reproducibility: Clear method descriptions facilitate reproduction and improvement

Application Scenarios

Search Engine Optimization: Identify and downgrade low-quality AI-generated content
Content Platform Governance: Large-scale detection of AI-generated content on platforms
Academic Research: Analyze AI's impact on web ecosystem
Regulatory Compliance: Assist in enforcing AI content disclosure requirements

References

Barbaresi, A. (2021). Trafilatura: A Web Scraping Library and Command-Line Tool for Text Discovery and Extraction. In ACL.
Dugan, L. et al. (2024). RAID: A Shared Benchmark for Robust Evaluation of Machine-Generated Text Detectors. In ACL.
Hans, A. et al. (2024). Spotting llms with binoculars: Zero-shot detection of machine-generated text. In ICML.

This paper holds significant importance in the field of AI-generated content detection. It not only proposes an effective technical solution but also reveals challenges facing the current web ecosystem through large-scale empirical research. Its aggregation detection strategy and website-level analysis methodology provide valuable insights for subsequent research.