Increasingly, web content is automatically generated by large language models (LLMs) with little human input. We call this "LLM-dominant" content. Since LLMs plagiarize and hallucinate, LLM-dominant content can be unreliable and unethical. Yet, websites rarely disclose such content, and human readers struggle to distinguish it. Thus, we must develop reliable detectors for LLM-dominant content. However, state-of-the-art LLM detectors are inaccurate on web content, because web content has low positive rates, complex markup, and diverse genres, instead of clean, prose-like benchmark data SoTA detectors are optimized for.
We propose a highly reliable, scalable pipeline that classifies entire websites. Instead of naively classifying text extracted from each page, we classify each site based on an LLM text detector's outputs of multiple prose-like pages to boost accuracies. We train and evaluate our detector by collecting 2 distinct ground truth datasets totaling 120 sites, and obtain 100% accuracies testing across them. In the wild, we detect a sizable portion of sites as LLM-dominant among 10k sites in search engine results and 10k in Common Crawl archives. We find LLM-dominant sites are growing in prevalence and rank highly in search results, raising questions about their impact on end users and the overall Web ecosystem.
academic- Paper ID: 2507.13933
- Title: Poster: Did I Just Browse A Website Written by LLMs?
- Authors: Sichang Steven He, Ramesh Govindan, Harsha V. Madhyastha (University of Southern California)
- Classification: cs.NI cs.AI cs.CL cs.IR
- Publication Venue/Date: IMC '25 (2025 ACM Internet Measurement Conference), October 28–31, 2025, Madison, WI, USA
- Paper Link: https://doi.org/10.1145/3730567.3768603
With the rise of large language models (LLMs), an increasing volume of web content is being automatically generated by LLMs with minimal human input. The authors refer to such content as "LLM-dominated" content. Due to plagiarism and hallucination issues inherent in LLMs, LLM-dominated content may be unreliable and unethical. However, websites rarely disclose such content, and human readers find it difficult to distinguish. Therefore, developing reliable LLM-dominated content detectors is imperative. Existing state-of-the-art LLM detectors perform poorly on web content due to its low positive sample rate, complex labeling, and diverse types—characteristics that differ from the clean prose benchmarks on which existing detectors are optimized.
This paper proposes a highly reliable and scalable pipeline for classifying entire websites. Rather than simply classifying text extracted from individual pages, the method classifies each site based on the outputs of an LLM text detector applied to multiple prose pages, thereby improving accuracy. Through collection and evaluation of two distinct real-world datasets (120 sites total), the approach achieves 100% accuracy in cross-dataset testing. In practical applications, a substantial proportion of LLM-dominated sites were detected among 10,000 sites each from search engine results and Common Crawl archives, revealing that the prevalence of these sites is growing and they rank highly in search results.
- Core Problem: How to reliably detect "LLM-dominated" website content generated by large language models on the web
- Problem Significance:
- LLM-generated content suffers from plagiarism and hallucination issues that may mislead users
- The EU AI Act requires disclosure of AI usage, yet websites rarely comply
- Humans struggle to distinguish LLM-generated content
The authors identify three key challenges:
- Inaccuracy of Text Detectors: Existing state-of-the-art detectors perform poorly in real-world settings with low false positive rate requirements
- Web Content Noise: Detectors are designed for clean prose and perform poorly on diverse web content types (e.g., link lists, privacy statements)
- Lack of Real-World Labels: While many benchmark datasets exist for text fragment detection, webpage-level datasets are lacking
- AI services enable anyone to generate web content in bulk at low cost
- Users have begun complaining about encountering LLM-dominated articles online
- There is a need to develop reliable detection methods to protect user experience and web ecosystem integrity
- Proposes a website-level LLM content detection pipeline: Improves accuracy by aggregating detection results across multiple pages
- Constructs two real-world datasets from different sources: 120 websites total for training and evaluation
- Achieves 100% cross-dataset accuracy: Demonstrates excellent performance on strict out-of-distribution testing
- Provides large-scale empirical study: Analyzes 20,000 real websites and reveals growth trends of LLM-dominated sites
- Uncovers important web ecosystem insights: LLM-dominated websites rank highly in search results and their prevalence is continuously increasing
- Input: Website URL
- Output: Binary classification result (LLM-dominated vs. human-authored)
- Constraint: Requires at least 15 filterable pages per website
- Randomly samples pages from website sitemaps or Wayback Machine content indices
- Accesses and renders HTML pages using Chromium
- Extracts main text content using the Trafilatura library
- Employs the Binoculars detector for LLM text detection
- Applies strict filtering rules:
- Filters short text
- Filters content with high proportions of lists, tables, and links
- Filters duplicate text within the site
- Ensures most filtered text is in prose format
- Samples 15-20 pages per website
- Computes Binoculars scores for each page
- Uses the 9 deciles of scores as feature vectors
- Trains a linear support vector machine (SVM) for website classification
- Aggregation Strategy: Rather than relying on individual page classification results, improves robustness by analyzing the distribution of scores across multiple pages
- Intelligent Filtering: Designs specialized filtering strategies tailored to web content diversity
- Distribution Features: Uses deciles to capture distributional characteristics of website content scores
- Website-Level Detection: Elevates detection from page-level to website-level, better aligning with practical application requirements
- Company Dataset:
- 30 human-authored company websites (from Russell 2000 stock index)
- 30 corresponding LLM-generated websites (using Wix.com's AI website builder)
- Personal Dataset:
- 30 personal websites (from IndieWeb Blogs)
- 30 corresponding LLM-generated websites (using B12.io)
- Search Engine Results: 17,036 websites (10,232 valid websites after filtering)
- Common Crawl: 10,479 randomly sampled websites (2020-2025)
- Accuracy
- False Positive Rate (FPR)
- Out-of-distribution generalization performance
- Binoculars detector (page-level)
- Comparative testing with 11 other text detectors
- Uses Binoculars as the base detector
- Linear SVM for final classification
- Samples 15-20 pages per website
- Uses 9 deciles as features
- Cross-dataset Accuracy: 100% (Company training → Personal testing, and vice versa)
- Binoculars Page-Level Accuracy: Maximum 93%
- SVM Website-Level Accuracy: 100% (complete separation of LLM and human websites)
- Search Engine Results:
- Detected 1,019 LLM-dominated websites (9.96%)
- LLM websites show no significant disadvantage in search rankings
- Discovered boundary ambiguity phenomenon (websites with partial LLM content)
- Common Crawl Analysis:
- Overall detection rate: 4.30% (451/10,479)
- Websites post-ChatGPT release: 7.25% (358/4,938)
- New websites in 2024-2025: 10.08% (77/764)
- False positive rate: 1.22% (16/1,315, pre-ChatGPT websites)
- Growth Trend: The proportion of LLM-dominated websites increases significantly over time
- Search Engine Bias: The proportion of LLM websites in search results far exceeds random sampling
- Ranking Impact: Search engines have not effectively penalized LLM-dominated content
- Content Characteristics: LLM websites are typically generic blogs with heavy advertising and false author information
- Effectiveness of aggregation analysis: Even with single-page detector accuracy of only 93%, website-level detection achieves 100%
- Importance of filtering strategy: Significantly reduces noise impact on detection performance
- Existing work primarily focuses on text fragment-level detection
- Detectors like Binoculars perform well under various attacks
- However, accuracy is insufficient in real web environments
- Lacks detection methods tailored to webpage content characteristics
- Existing methods do not account for web content diversity and noise
- Primarily concentrated in the text domain
- Lacks research on ecosystem-wide impacts of AI-generated content
- The proposed aggregation detection pipeline demonstrates superior performance in website-level LLM content detection
- LLM-dominated websites are growing rapidly on the web, particularly in search results
- Existing search engines fail to effectively identify and downrank LLM content
- The web ecosystem faces significant impacts from AI-generated content
- False Positive Issues: A 1.22% false positive rate remains
- Boundary Ambiguity: Some websites contain mixed content, making accurate classification difficult
- Dataset Scale: The baseline dataset is relatively small (120 websites)
- Detector Dependency: Performance is influenced by the quality of the underlying text detector
- Investigate motivations and methods of LLM content generators
- Extend to detection of AI-generated images and other AI-generated content
- Quantify the impact of AI-generated content on web ecosystem
- Improve detection methods to handle mixed-content websites
- Problem-Driven Approach: Addresses an important contemporary issue in the web environment
- Methodological Innovation: Aggregation method that elevates detection from page-level to website-level
- Rigorous Experimentation: Cross-dataset validation ensures method generalizability
- Large-Scale Validation: Testing on 20,000 real websites is convincing
- Important Findings: Reveals growth trends of LLM content on the web
- Baseline Dataset Limitations: Only 120 websites may lack sufficient representativeness
- Detector Selection: Over-reliance on Binoculars performance
- Boundary Handling: Incomplete strategies for handling mixed-content websites
- Dynamic Adaptability: Does not account for rapid LLM technology evolution affecting detection
- Academic Contribution: First systematic study of website-level LLM content detection
- Practical Value: Provides effective tools for search engines and content platforms
- Social Significance: Helps maintain web content quality and user experience
- Reproducibility: Clear method descriptions facilitate reproduction and improvement
- Search Engine Optimization: Identify and downgrade low-quality AI-generated content
- Content Platform Governance: Large-scale detection of AI-generated content on platforms
- Academic Research: Analyze AI's impact on web ecosystem
- Regulatory Compliance: Assist in enforcing AI content disclosure requirements
- Barbaresi, A. (2021). Trafilatura: A Web Scraping Library and Command-Line Tool for Text Discovery and Extraction. In ACL.
- Dugan, L. et al. (2024). RAID: A Shared Benchmark for Robust Evaluation of Machine-Generated Text Detectors. In ACL.
- Hans, A. et al. (2024). Spotting llms with binoculars: Zero-shot detection of machine-generated text. In ICML.
This paper holds significant importance in the field of AI-generated content detection. It not only proposes an effective technical solution but also reveals challenges facing the current web ecosystem through large-scale empirical research. Its aggregation detection strategy and website-level analysis methodology provide valuable insights for subsequent research.