2025-11-12T00:34:29.273016

Shifting AI Efficiency From Model-Centric to Data-Centric Compression

Liu, Wen, Wang et al.

The advancement of large language models (LLMs) and multi-modal LLMs (MLLMs) has historically relied on scaling model parameters. However, as hardware limits constrain further model growth, the primary computational bottleneck has shifted to the quadratic cost of self-attention over increasingly long sequences by ultra-long text contexts, high-resolution images, and extended videos. In this position paper, \textbf{we argue that the focus of research for efficient artificial intelligence (AI) is shifting from model-centric compression to data-centric compression}. We position data-centric compression as the emerging paradigm, which improves AI efficiency by directly compressing the volume of data processed during model training or inference. To formalize this shift, we establish a unified framework for existing efficiency strategies and demonstrate why it constitutes a crucial paradigm change for long-context AI. We then systematically review the landscape of data-centric compression methods, analyzing their benefits across diverse scenarios. Finally, we outline key challenges and promising future research directions. Our work aims to provide a novel perspective on AI efficiency, synthesize existing efforts, and catalyze innovation to address the challenges posed by ever-increasing context lengths.

academic

Shifting AI Efficiency From Model-Centric to Data-Centric Compression

Basic Information

Paper ID: 2505.19147
Title: Shifting AI Efficiency From Model-Centric to Data-Centric Compression
Authors: Xuyang Liu, Zichen Wen, Shaobo Wang, Junjie Chen, Zhishan Tao, Yubo Wang, Tailai Chen, Xiangqi Jin, Chang Zou, Yiyu Wang, Chenfei Liao, Xu Zheng, Honggang Chen, Weijia Li, Xuming Hu, Conghui He, Linfeng Zhang
Classification: cs.CL, cs.AI, cs.CV
Publication Date/Venue: arXiv preprint (January 2025)
Paper Link: https://arxiv.org/abs/2505.19147

Abstract

With the development of Large Language Models (LLMs) and Multimodal Large Language Models (MLLMs), traditional approaches that rely on expanding model parameters to improve performance face hardware constraints. The primary computational bottleneck has shifted from model scale to the quadratic complexity overhead of self-attention mechanisms when processing ultra-long text contexts, high-resolution images, and extended videos. This paper proposes that the focus of AI efficiency research should shift from model-centric compression to data-centric compression. Data-centric compression improves AI efficiency by directly compressing the volume of data processed during training or inference. The paper establishes a unified efficiency strategy framework, systematically reviews the landscape of data-centric compression methods, analyzes their advantages across different scenarios, and outlines key challenges and future research directions.

Research Background and Motivation

Problem Definition

The core problem addressed in this paper is: as the context length processed by AI models grows dramatically, how can we effectively address the resulting computational efficiency challenges?

Importance Analysis

Technological Trend Shifts: From 2022-2024, AI performance improvements primarily relied on model scale expansion, but by 2024 model scale growth plateaued (approximately 1T parameters), while context length continues to grow exponentially
Computational Bottleneck Migration: The primary computational overhead has shifted from linear parameter growth to the quadratic complexity O(n²) of self-attention mechanisms
Cross-domain Requirements: Language models require processing longer reasoning chains, vision models need to handle higher-resolution images and longer videos, and generative models need to create higher-quality content

Limitations of Existing Methods

Traditional model-centric compression methods (quantization, pruning, distillation, low-rank decomposition) primarily optimize model parameters W, but cannot effectively address challenges posed by growing context lengths. These methods still require processing complete input data X when facing long sequences, failing to fundamentally solve the quadratic complexity problem.

Research Motivation

Based on in-depth analysis of AI development trends, the authors propose data-centric compression as an emerging paradigm that addresses long-context challenges by directly reducing the volume of processed data, offering superior generality, efficiency, and compatibility.

Core Contributions

Paradigm Shift Analysis: Analyzes the critical transition in AI efficiency research from parameter-centric to context-centric computational bottlenecks, arguing for the necessity of efficiency optimization paradigm transformation
Unified Theoretical Framework: Establishes a unified mathematical formulation framework encompassing architectural design, model-centric compression, and data-centric compression
Systematic Survey: Conducts comprehensive investigation of data-centric compression methods, constructs a unified classification framework, and analyzes advantages across different scenarios
Challenges and Directions: Provides in-depth analysis of current challenges and proposes promising future research directions, aiming to catalyze innovation in this field

Methodology Details

Task Definition

Data-centric compression aims to transform the original input sequence X into a compressed representation X' through compression operation Φ, satisfying |X'| < |X|, while maintaining model performance as much as possible.

Unified Framework

Given input data X and network parameters W, the output of neural network F is:

Y = F(W, X)

Efficiency optimization can be approached from three perspectives:

Efficient Computational Architecture (F): Design architectures with linear or sub-quadratic complexity
Model-Centric Compression (W): W' = Γ(W), |W'| < |W|
Data-Centric Compression (X): X' = Φ(X), |X'| < |X|

Data-Centric Compression Architecture

Compression Criteria (E)

Parametric Methods:

Training-aware methods: Learn scoring functions through optimizing additional parameters Δθ during training
Training-agnostic methods: Directly use pre-trained networks as scoring functions

Non-parametric Methods:

Intrinsic computation methods: Utilize internal model computations (e.g., attention weights) for token scoring
External computation methods: Design additional metrics to assess token relationships

Compression Strategies (P)

Token Pruning: Directly discard tokens with low importance

X' = X \ {xt | st < τ}

Token Merging: Merge tokens through semantic similarity

x'_m = Σ(t:π(t)=m) wt * xt, wt = st / Σ(t':π(t')=m) st'

Technical Innovations

Dual-stage Efficiency: Accelerates both training and inference phases simultaneously
Architectural Compatibility: Orthogonal to existing compression methods, enabling seamless integration
Quadratic Gains: Leverages O(n²) complexity of self-attention for significant computational savings
Universal Applicability: Consistent token redundancy across modalities and tasks
Low Implementation Cost: Modern architectures support variable-length inputs without requiring retraining

Experimental Setup

Datasets and Evaluation

The paper validates the effectiveness of data-centric compression methods through experiments across multiple domains:

Complex Reasoning Tasks:

MATH-500, AIME24, GSM8K
Model: DeepSeek-R1-Distill-Llama-8B
KV cache budget: 1024 tokens

Image Understanding Tasks:

GQA, MMB, MMB-CN
Model: LLaVA-1.5-7B
Retain 25% visual tokens

Video Understanding Tasks:

MVBench, MLVU, VideoMME
Model: LLaVA-OneVision-7B
Retain 15% visual tokens

Image Generation Tasks:

Model: FLUX.1-dev (DiT-based)
Cache cycle N=4, ratio R=90%

Comparison Methods

KV Cache Methods: H2O, SnapKV, KNorm
Visual Compression Methods: FastV, SparseVLM, PDrop
Baseline Methods: Random dropping, Pooling

Experimental Results

Main Findings

The experiments reveal a counterintuitive phenomenon: carefully designed compression methods underperform random dropping across multiple scenarios.

Complex Reasoning Tasks

On AIME24, random dropping achieves 10% higher accuracy than SnapKV
H2O, SnapKV, and KNorm consistently underperform random dropping

Image Understanding Tasks

Random dropping and pooling operations outperform some designed methods
Spatial uniformity mitigates position bias in attention-based methods

Video Understanding Tasks

Even retaining only 15% of tokens, random dropping outperforms designed methods
Uniform spatiotemporal token distribution is crucial for video representation

Image Generation Tasks

All feature-based strategies score lower than random selection
Similar token clustering results in the worst generation quality

Performance Analysis

Data-centric compression yields significant computational and memory benefits:

Computational Complexity: Ω(X')/Ω(X) = O(m²/n²) Memory Usage: M(X')/M(X) ≈ m/n KV Cache Optimization: MKV(X')/MKV(X) = m/n

Classification of Efficiency Optimization Methods

Efficient Architectures: Linear Attention, RWKV, State Space Models (Mamba)
Model Compression: Pruning, quantization, distillation, low-rank decomposition
Data Compression: Dataset compression, token compression

Positioning of This Work's Contributions

First systematic positioning of data-centric compression as a new paradigm for AI efficiency
Establishes a unified theoretical framework integrating various efficiency strategies
Provides comprehensive cross-domain analysis and evaluation

Conclusions and Discussion

Main Conclusions

Paradigm Shift: The focus of AI efficiency research should shift from model-centric to data-centric compression
Method Limitations: Current attention-based compression methods suffer from fundamental position bias issues
Design Principles: Spatial and temporal uniformity are key design principles for effective compression

Current Challenges

Performance Degradation Issues

Methodological Bottlenecks: Position bias in attention scores affects compression effectiveness
Inherent Limitations: Certain tasks (e.g., visual localization, OCR parsing) are sensitive to compression

Suboptimal Data Representation

Both redundancy-based and importance-based methods cannot guarantee optimal downstream modeling representations
Lack of consideration for sequence structure and semantic pattern stability

Evaluation Fairness

FLOPs and compression ratios do not accurately reflect actual acceleration effects
Lack of specialized benchmarks for compression evaluation

Future Directions

Data-Model Collaborative Compression

Staged Integration: Model compression followed by data compression
Mutual Enhancement: Utilize gradient information to guide token selection, employ token evolution to guide layer pruning

Specialized Evaluation Benchmarks

Cross-domain task coverage (NLP, CV, multimodal)
Compression-sensitive tasks (OCR, ASR)
Joint performance-latency evaluation

In-Depth Evaluation

Strengths

Forward-Looking Insights: Accurately identifies critical trend shifts in AI development and proposes forward-thinking research paradigms
Theoretical Contributions: Establishes a unified mathematical framework providing theoretical foundations for different efficiency strategies
Comprehensive Analysis: Conducts systematic method classification and analysis across multiple domains and tasks
Empirical Findings: Through extensive experiments, reveals fundamental problems in current methods, providing important insights for field development
Writing Quality: Clear logic, accurate expression, rich figures and tables, easy to understand

Limitations

Theoretical Depth: While providing a unified framework, theoretical analysis of data-centric compression lacks sufficient depth
Method Innovation: Primarily a survey work, lacking specific novel method proposals
Experimental Scope: Experiments mainly focus on verifying problems with existing methods, lacking exploration of solutions
Quantitative Analysis: Theoretical complexity analysis of different compression methods lacks sufficient detail

Impact

Field Contribution: Provides new perspectives and directions for AI efficiency research, potentially leading to paradigm shifts in research focus
Practical Value: Analysis results provide important guidance for practical deployment, particularly in resource-constrained environments
Reproducibility: Provides detailed experimental settings and GitHub projects, facilitating subsequent research
Inspirational Value: Revealed problems and proposed directions provide clear roadmaps for future research

Applicable Scenarios

Long-Context Applications: Particularly suitable for scenarios requiring processing long text, high-resolution images, or extended videos
Resource-Constrained Environments: Holds significant value in scenarios with limited computational resources such as mobile devices and edge computing
Real-Time Interactive Systems: UI agents, autonomous driving, embodied AI and other systems requiring efficient continuous input processing
Large-Scale Deployment: Efficiency optimization for cloud service providers in large-scale model deployment

References

The paper cites extensive related work, primarily including:

Transformer architectures and variants (Vaswani et al., 2017)
Large language model series (OpenAI GPT, Meta LLaMA, Qwen, etc.)
Multimodal models (LLaVA, InternVL, etc.)
Efficiency optimization methods (classical works on quantization, pruning, distillation, etc.)
Representative works on data-centric compression

This paper provides an important theoretical framework and practical guidance for AI efficiency research, possessing significant academic value and practical significance.