Existing deepfake analysis methods are primarily based on discriminative models, which significantly limit their application scenarios. This paper aims to explore interactive deepfake analysis by performing instruction tuning on multi-modal large language models (MLLMs). This will face challenges such as the lack of datasets and benchmarks, and low training efficiency. To address these issues, we introduce (1) a GPT-assisted data construction process resulting in an instruction-following dataset called DFA-Instruct, (2) a benchmark named DFA-Bench, designed to comprehensively evaluate the capabilities of MLLMs in deepfake detection, deepfake classification, and artifact description, and (3) construct an interactive deepfake analysis system called DFA-GPT, as a strong baseline for the community, with the Low-Rank Adaptation (LoRA) module. The dataset and code will be made available at https://github.com/lxq1000/DFA-Instruct to facilitate further research.
Existing deepfake analysis methods are primarily based on discriminative models, which significantly limits their application scenarios. This paper aims to explore interactive deepfake analysis through instruction tuning of multimodal large language models (MLLMs). The research faces challenges including dataset scarcity, benchmark deficiency, and low training efficiency. To address these issues, the authors propose: (1) a GPT-assisted data construction process that produces the DFA-Instruct instruction-following dataset; (2) the DFA-Bench benchmark for comprehensive evaluation of MLLMs' capabilities in deepfake detection, classification, and artifact description; (3) the DFA-GPT interactive deepfake analysis system, employing Low-Rank Adaptation (LoRA) modules as a strong baseline for the community.
With the rapid development of AI-generated content (AIGC), the boundary between fiction and reality has become blurred. Unauthorized deepfake images or videos may be used for malicious purposes such as opinion manipulation, cyberbullying, extortion, and evidence fabrication. Deepfake analysis (DFA) is crucial for regulating and mitigating the potential negative impacts of deepfake technology.
Current deepfake analysis methods primarily rely on discriminative models for deepfake detection and classification, which restricts their application scope. Traditional approaches can only provide simple binary classification results (authentic/forged) or technical categories, without offering detailed artifact descriptions or enabling interactive dialogue.
In critical domains such as social security, personal privacy protection, and forensic investigation, interactive deepfake analysis systems can provide human experts with clues requiring further manual examination, significantly improving work efficiency. Multimodal large language models have achieved remarkable success in describing and reasoning about fine-grained complex visual cues, making them suitable as instruction-tuned interactive deepfake analysis systems.
First proposal of interactive deepfake analysis concept: Defines four core capabilities: deepfake detection (DF-D), deepfake classification (DF-C), artifact description (AD), and free-form conversation (FC)
Construction of large-scale instruction-following dataset DFA-Instruct: Contains 127.3K aligned facial images and 891.6K question-answer pairs, employing a GPT-assisted data construction pipeline
Establishment of comprehensive evaluation benchmark DFA-Bench: First to provide an evaluation framework for artifact description tasks in deepfake analysis
Development of DFA-GPT system: Employs LoRA-based efficient training strategy, successfully constructing an interactive deepfake analysis system with limited computational resources
Effectiveness of Language Supervision: Introducing LLM and natural language supervision significantly enhances robustness of the deepfake analysis system
Benefits of Multi-task Learning: Additional supervision signals contribute to building more robust deepfake analysis systems
Insufficiency of General MLLMs: Existing advanced MLLMs exhibit significant deficiencies in deepfake understanding
The paper cites 48 relevant references covering important works in deepfake technology, detection methods, multimodal large language models, and instruction tuning, providing solid theoretical foundation for the research.
Overall Assessment: This is a high-quality paper of pioneering significance, systematically exploring interactive deepfake analysis for the first time. The paper demonstrates excellent performance in technical innovation, experimental design, and practical value, making important contributions to the development of deepfake analysis field. Despite certain limitations, its pioneering research approach and systematic solutions endow it with significant academic and practical value.