2025-11-12T00:34:29.273016

Shifting AI Efficiency From Model-Centric to Data-Centric Compression

Liu, Wen, Wang et al.
The advancement of large language models (LLMs) and multi-modal LLMs (MLLMs) has historically relied on scaling model parameters. However, as hardware limits constrain further model growth, the primary computational bottleneck has shifted to the quadratic cost of self-attention over increasingly long sequences by ultra-long text contexts, high-resolution images, and extended videos. In this position paper, \textbf{we argue that the focus of research for efficient artificial intelligence (AI) is shifting from model-centric compression to data-centric compression}. We position data-centric compression as the emerging paradigm, which improves AI efficiency by directly compressing the volume of data processed during model training or inference. To formalize this shift, we establish a unified framework for existing efficiency strategies and demonstrate why it constitutes a crucial paradigm change for long-context AI. We then systematically review the landscape of data-centric compression methods, analyzing their benefits across diverse scenarios. Finally, we outline key challenges and promising future research directions. Our work aims to provide a novel perspective on AI efficiency, synthesize existing efforts, and catalyze innovation to address the challenges posed by ever-increasing context lengths.
academic

Shifting AI Efficiency From Model-Centric to Data-Centric Compression

Basic Information

  • Paper ID: 2505.19147
  • Title: Shifting AI Efficiency From Model-Centric to Data-Centric Compression
  • Authors: Xuyang Liu, Zichen Wen, Shaobo Wang, Junjie Chen, Zhishan Tao, Yubo Wang, Tailai Chen, Xiangqi Jin, Chang Zou, Yiyu Wang, Chenfei Liao, Xu Zheng, Honggang Chen, Weijia Li, Xuming Hu, Conghui He, Linfeng Zhang
  • Classification: cs.CL, cs.AI, cs.CV
  • Publication Date/Venue: arXiv preprint (January 2025)
  • Paper Link: https://arxiv.org/abs/2505.19147

Abstract

With the development of Large Language Models (LLMs) and Multimodal Large Language Models (MLLMs), traditional approaches that rely on expanding model parameters to improve performance face hardware constraints. The primary computational bottleneck has shifted from model scale to the quadratic complexity overhead of self-attention mechanisms when processing ultra-long text contexts, high-resolution images, and extended videos. This paper proposes that the focus of AI efficiency research should shift from model-centric compression to data-centric compression. Data-centric compression improves AI efficiency by directly compressing the volume of data processed during training or inference. The paper establishes a unified efficiency strategy framework, systematically reviews the landscape of data-centric compression methods, analyzes their advantages across different scenarios, and outlines key challenges and future research directions.

Research Background and Motivation

Problem Definition

The core problem addressed in this paper is: as the context length processed by AI models grows dramatically, how can we effectively address the resulting computational efficiency challenges?

Importance Analysis

  1. Technological Trend Shifts: From 2022-2024, AI performance improvements primarily relied on model scale expansion, but by 2024 model scale growth plateaued (approximately 1T parameters), while context length continues to grow exponentially
  2. Computational Bottleneck Migration: The primary computational overhead has shifted from linear parameter growth to the quadratic complexity O(n²) of self-attention mechanisms
  3. Cross-domain Requirements: Language models require processing longer reasoning chains, vision models need to handle higher-resolution images and longer videos, and generative models need to create higher-quality content

Limitations of Existing Methods

Traditional model-centric compression methods (quantization, pruning, distillation, low-rank decomposition) primarily optimize model parameters W, but cannot effectively address challenges posed by growing context lengths. These methods still require processing complete input data X when facing long sequences, failing to fundamentally solve the quadratic complexity problem.

Research Motivation

Based on in-depth analysis of AI development trends, the authors propose data-centric compression as an emerging paradigm that addresses long-context challenges by directly reducing the volume of processed data, offering superior generality, efficiency, and compatibility.

Core Contributions

  1. Paradigm Shift Analysis: Analyzes the critical transition in AI efficiency research from parameter-centric to context-centric computational bottlenecks, arguing for the necessity of efficiency optimization paradigm transformation
  2. Unified Theoretical Framework: Establishes a unified mathematical formulation framework encompassing architectural design, model-centric compression, and data-centric compression
  3. Systematic Survey: Conducts comprehensive investigation of data-centric compression methods, constructs a unified classification framework, and analyzes advantages across different scenarios
  4. Challenges and Directions: Provides in-depth analysis of current challenges and proposes promising future research directions, aiming to catalyze innovation in this field

Methodology Details

Task Definition

Data-centric compression aims to transform the original input sequence X into a compressed representation X' through compression operation Φ, satisfying |X'| < |X|, while maintaining model performance as much as possible.

Unified Framework

Given input data X and network parameters W, the output of neural network F is:

Y = F(W, X)

Efficiency optimization can be approached from three perspectives:

  1. Efficient Computational Architecture (F): Design architectures with linear or sub-quadratic complexity
  2. Model-Centric Compression (W): W' = Γ(W), |W'| < |W|
  3. Data-Centric Compression (X): X' = Φ(X), |X'| < |X|

Data-Centric Compression Architecture

Compression Criteria (E)

Parametric Methods:

  • Training-aware methods: Learn scoring functions through optimizing additional parameters Δθ during training
  • Training-agnostic methods: Directly use pre-trained networks as scoring functions

Non-parametric Methods:

  • Intrinsic computation methods: Utilize internal model computations (e.g., attention weights) for token scoring
  • External computation methods: Design additional metrics to assess token relationships

Compression Strategies (P)

Token Pruning: Directly discard tokens with low importance

X' = X \ {xt | st < τ}

Token Merging: Merge tokens through semantic similarity

x'_m = Σ(t:π(t)=m) wt * xt, wt = st / Σ(t':π(t')=m) st'

Technical Innovations

  1. Dual-stage Efficiency: Accelerates both training and inference phases simultaneously
  2. Architectural Compatibility: Orthogonal to existing compression methods, enabling seamless integration
  3. Quadratic Gains: Leverages O(n²) complexity of self-attention for significant computational savings
  4. Universal Applicability: Consistent token redundancy across modalities and tasks
  5. Low Implementation Cost: Modern architectures support variable-length inputs without requiring retraining

Experimental Setup

Datasets and Evaluation

The paper validates the effectiveness of data-centric compression methods through experiments across multiple domains:

Complex Reasoning Tasks:

  • MATH-500, AIME24, GSM8K
  • Model: DeepSeek-R1-Distill-Llama-8B
  • KV cache budget: 1024 tokens

Image Understanding Tasks:

  • GQA, MMB, MMB-CN
  • Model: LLaVA-1.5-7B
  • Retain 25% visual tokens

Video Understanding Tasks:

  • MVBench, MLVU, VideoMME
  • Model: LLaVA-OneVision-7B
  • Retain 15% visual tokens

Image Generation Tasks:

  • Model: FLUX.1-dev (DiT-based)
  • Cache cycle N=4, ratio R=90%

Comparison Methods

  • KV Cache Methods: H2O, SnapKV, KNorm
  • Visual Compression Methods: FastV, SparseVLM, PDrop
  • Baseline Methods: Random dropping, Pooling

Experimental Results

Main Findings

The experiments reveal a counterintuitive phenomenon: carefully designed compression methods underperform random dropping across multiple scenarios.

Complex Reasoning Tasks

  • On AIME24, random dropping achieves 10% higher accuracy than SnapKV
  • H2O, SnapKV, and KNorm consistently underperform random dropping

Image Understanding Tasks

  • Random dropping and pooling operations outperform some designed methods
  • Spatial uniformity mitigates position bias in attention-based methods

Video Understanding Tasks

  • Even retaining only 15% of tokens, random dropping outperforms designed methods
  • Uniform spatiotemporal token distribution is crucial for video representation

Image Generation Tasks

  • All feature-based strategies score lower than random selection
  • Similar token clustering results in the worst generation quality

Performance Analysis

Data-centric compression yields significant computational and memory benefits:

Computational Complexity: Ω(X')/Ω(X) = O(m²/n²) Memory Usage: M(X')/M(X) ≈ m/n KV Cache Optimization: MKV(X')/MKV(X) = m/n

Classification of Efficiency Optimization Methods

  1. Efficient Architectures: Linear Attention, RWKV, State Space Models (Mamba)
  2. Model Compression: Pruning, quantization, distillation, low-rank decomposition
  3. Data Compression: Dataset compression, token compression

Positioning of This Work's Contributions

  • First systematic positioning of data-centric compression as a new paradigm for AI efficiency
  • Establishes a unified theoretical framework integrating various efficiency strategies
  • Provides comprehensive cross-domain analysis and evaluation

Conclusions and Discussion

Main Conclusions

  1. Paradigm Shift: The focus of AI efficiency research should shift from model-centric to data-centric compression
  2. Method Limitations: Current attention-based compression methods suffer from fundamental position bias issues
  3. Design Principles: Spatial and temporal uniformity are key design principles for effective compression

Current Challenges

Performance Degradation Issues

  • Methodological Bottlenecks: Position bias in attention scores affects compression effectiveness
  • Inherent Limitations: Certain tasks (e.g., visual localization, OCR parsing) are sensitive to compression

Suboptimal Data Representation

  • Both redundancy-based and importance-based methods cannot guarantee optimal downstream modeling representations
  • Lack of consideration for sequence structure and semantic pattern stability

Evaluation Fairness

  • FLOPs and compression ratios do not accurately reflect actual acceleration effects
  • Lack of specialized benchmarks for compression evaluation

Future Directions

Data-Model Collaborative Compression

  • Staged Integration: Model compression followed by data compression
  • Mutual Enhancement: Utilize gradient information to guide token selection, employ token evolution to guide layer pruning

Specialized Evaluation Benchmarks

  • Cross-domain task coverage (NLP, CV, multimodal)
  • Compression-sensitive tasks (OCR, ASR)
  • Joint performance-latency evaluation

In-Depth Evaluation

Strengths

  1. Forward-Looking Insights: Accurately identifies critical trend shifts in AI development and proposes forward-thinking research paradigms
  2. Theoretical Contributions: Establishes a unified mathematical framework providing theoretical foundations for different efficiency strategies
  3. Comprehensive Analysis: Conducts systematic method classification and analysis across multiple domains and tasks
  4. Empirical Findings: Through extensive experiments, reveals fundamental problems in current methods, providing important insights for field development
  5. Writing Quality: Clear logic, accurate expression, rich figures and tables, easy to understand

Limitations

  1. Theoretical Depth: While providing a unified framework, theoretical analysis of data-centric compression lacks sufficient depth
  2. Method Innovation: Primarily a survey work, lacking specific novel method proposals
  3. Experimental Scope: Experiments mainly focus on verifying problems with existing methods, lacking exploration of solutions
  4. Quantitative Analysis: Theoretical complexity analysis of different compression methods lacks sufficient detail

Impact

  1. Field Contribution: Provides new perspectives and directions for AI efficiency research, potentially leading to paradigm shifts in research focus
  2. Practical Value: Analysis results provide important guidance for practical deployment, particularly in resource-constrained environments
  3. Reproducibility: Provides detailed experimental settings and GitHub projects, facilitating subsequent research
  4. Inspirational Value: Revealed problems and proposed directions provide clear roadmaps for future research

Applicable Scenarios

  1. Long-Context Applications: Particularly suitable for scenarios requiring processing long text, high-resolution images, or extended videos
  2. Resource-Constrained Environments: Holds significant value in scenarios with limited computational resources such as mobile devices and edge computing
  3. Real-Time Interactive Systems: UI agents, autonomous driving, embodied AI and other systems requiring efficient continuous input processing
  4. Large-Scale Deployment: Efficiency optimization for cloud service providers in large-scale model deployment

References

The paper cites extensive related work, primarily including:

  • Transformer architectures and variants (Vaswani et al., 2017)
  • Large language model series (OpenAI GPT, Meta LLaMA, Qwen, etc.)
  • Multimodal models (LLaVA, InternVL, etc.)
  • Efficiency optimization methods (classical works on quantization, pruning, distillation, etc.)
  • Representative works on data-centric compression

This paper provides an important theoretical framework and practical guidance for AI efficiency research, possessing significant academic value and practical significance.