2025-11-18T08:22:12.824474

Can Prompts Rewind Time for LLMs? Evaluating the Effectiveness of Prompted Knowledge Cutoffs

Gao, Zhang, Du et al.

Large Language Models (LLMs) are widely used for temporal prediction, but their reliance on pretraining data raises contamination concerns, as accurate predictions on pre-cutoff test data may reflect memorization rather than reasoning, leading to an overestimation of their generalization capability. With the recent emergence of prompting-based unlearning techniques, a natural question arises: Can LLMs be prompted to simulate an earlier knowledge cutoff? In this work, we investigate the capability of prompting to simulate earlier knowledge cutoff in LLMs. We construct three evaluation datasets to assess the extent to which LLMs can forget (1) direct factual knowledge, (2) semantic shifts, and (3) causally related knowledge. Results demonstrate that while prompt-based simulated knowledge cutoffs show effectiveness when directly queried with the information after that date, they struggle to induce forgetting when the forgotten content is not directly asked but causally related to the query. These findings highlight the need for more rigorous evaluation settings when applying LLMs for temporal prediction tasks. The full dataset and evaluation code are available at https://github.com/gxx27/time_unlearn.

academic

Can Prompts Rewind Time for LLMs? Evaluating the Effectiveness of Prompted Knowledge Cutoffs

Basic Information

Paper ID: 2510.02340
Title: Can Prompts Rewind Time for LLMs? Evaluating the Effectiveness of Prompted Knowledge Cutoffs
Authors: Xin Gao, Ruiyi Zhang, Daniel Du, Saurabh Mahindre, Sai Ashish Somayajula, Pengtao Xie
Institutions: UC San Diego, SUNY Buffalo
Classification: cs.CL cs.LG
Publication Date: October 15, 2025 (arXiv v2)
Paper Link: https://arxiv.org/abs/2510.02340

Abstract

Large Language Models (LLMs) are widely applied to temporal prediction tasks, yet their reliance on pretraining data raises concerns about data contamination. Accurate predictions on test data preceding the pretraining cutoff date may reflect memorization rather than reasoning, leading to overestimation of generalization capabilities. With the emergence of prompt-based forgetting techniques, a natural question arises: can LLMs be prompted to simulate earlier knowledge cutoff dates? This study investigates the capability of prompts to simulate earlier knowledge cutoffs by constructing three evaluation datasets to assess the extent to which LLMs can forget (1) direct factual knowledge, (2) semantic shifts, and (3) causally-related knowledge. Results demonstrate that while prompt-based simulated knowledge cutoffs are effective when directly querying information after the specified date, they struggle to induce forgetting when the content to be forgotten is not directly queried but is causally related to the query.

Research Background and Motivation

Core Problem

The core problem addressed by this research is the impact of data contamination on the evaluation of LLMs in temporal prediction tasks. When LLMs are applied to time-series tasks such as stock prediction and event forecasting, models may have already encountered test data during the pretraining phase, leading to inflated prediction performance.

Problem Significance

Evaluation Credibility: Traditional machine learning models (e.g., random forests) are trained from scratch, making test performance trustworthy; however, LLMs may have already seen data from the test period, leading to unfair comparisons
Generalization Capability Misjudgment: Memory-based "predictions" overestimate the model's true generalization ability
Practical Application Risk: When deployed in real-world scenarios, model performance may fall significantly short of expectations

Limitations of Existing Methods

Traditional machine unlearning methods require retraining or fine-tuning, incurring high computational costs
Lack of effective methods to simulate LLMs' historical knowledge states
Existing evaluation frameworks insufficiently address temporal contamination issues

Research Motivation

Inspired by recently emerging in-context forgetting techniques, the authors propose adjusting LLMs' knowledge cutoff dates through prompt engineering to make them "forget" all information after a specified date, thereby enabling fairer temporal prediction evaluation.

Core Contributions

First Systematic Study: First systematic evaluation of the effectiveness of simulating LLMs' knowledge cutoff dates through prompting
Multi-dimensional Evaluation Framework: Construction of three datasets across different dimensions to comprehensively assess forgetting capabilities:
- Factual subset: Direct factual knowledge forgetting
- Semantic subset: Semantic shift forgetting
- Counterfactual subset: Causally-related knowledge forgetting
Important Findings: Reveals limitations of prompt-based forgetting—significant performance degradation in causal reasoning scenarios
Evaluation Benchmark: Provides high-quality datasets and evaluation code, establishing a foundation for future research
Practical Guidance: Provides methodological guidance for rigorous evaluation of LLMs in temporal prediction tasks

Methodology Details

Task Definition

Input: User query + system prompt specifying simulated knowledge cutoff date Output: Response conforming to the knowledge state of the specified cutoff date Constraint: Model should "forget" all information after the cutoff date

Prompt Design Strategies

Prompt P1: Knowledge Filtering Type

You must answer this question using only knowledge that was publicly available before the year {unlearn_year}. 
Specifically, your memory ends on December 31, {unlearn_year_minus_1}, and you have no access to anything that occurred in {unlearn_year} or afterward.

Prompt P2: Reasoning Constraint Type

You may think step by step internally, but your memory ends on December 31, {unlearn_year_minus_1}. 
You are strictly forbidden from referencing or reasoning about any information, event, or trend that emerged in {unlearn_year} or later.

Evaluation Method Design

Forgetting Success Rate Calculation

For Factual and Counterfactual subsets, multiple-choice format is used, with forgetting success defined as the model changing its original answer.

For the Semantic subset, semantic similarity is employed: $\text{Success} = \frac{\cos(o_a, y_a)}{\cos(o_a, y_a) + \cos(o_a, y_b)} > \frac{\cos(o_b, y_a)}{\cos(o_b, y_a) + \cos(o_b, y_b)}$

where $o_a, o_b$ represent outputs before and after forgetting, and $y_a, y_b$ represent ground truth answers before and after the cutoff date.

Experimental Setup

Dataset Construction

Factual Subset (675 samples)

Objective: Evaluate direct factual knowledge forgetting
Construction Method: Use GPT-4o to generate major historical events and corresponding Q&A pairs since 1960
Time Span: 1960-2024
Example: Querying the U.S. President at a specific time point should return the incumbent at that time rather than the current president

Semantic Subset (303 samples)

Objective: Evaluate semantic shift forgetting
Construction Method: Collect words with semantic changes, such as "TikTok" evolving from an onomatopoeia to a social media platform
Time Span: 2000-2024
Evaluation: Use MPNet model to compute semantic similarity

Counterfactual Subset (689 samples)

Objective: Evaluate causally-related knowledge forgetting
Construction Method: Construct counterfactual prediction scenarios based on major events
Time Span: 2000-2024
Example: Predict the year of the Tokyo Olympics with a 2018 cutoff (should answer 2020 rather than the actual 2021)

Experimental Models

DeepSeek-V3: Latest open-source model
LLaMA-3.1-405B: Meta's large-scale model
GPT-4o: OpenAI's multimodal model
DeepSeek-R1 & OpenAI o3: Reasoning-enhanced models (comparative experiments)

Evaluation Metrics

Primary Metric: Unlearn Success Rate
Calculation Method: Number of successfully forgotten samples / Total number of samples

Experimental Results

Main Results

Model	Factual	Semantic	Counterfactual
DeepSeek-V3	79.0%	57.5%	13.9%
LLaMA-3.1-405B	82.4%	80.4%	26.5%
GPT-4o	86.0%	72.0%	17.3%
Average	82.5%	70.0%	19.2%

Key Findings

Direct Query Effectiveness: Factual subset achieves an average success rate of 82.5%, indicating that prompt-based forgetting is relatively effective for direct factual queries
Moderate Semantic Forgetting: Semantic subset achieves an average success rate of 70.0%, showing that models can partially revert to historical word meanings
Causal Reasoning Difficulty: Counterfactual subset achieves only 19.2% success rate, revealing significant limitations of prompt-based forgetting
Reasoning Model Advantage: Reasoning-enhanced models (DeepSeek-R1: 71.2%, OpenAI o3: 50.6%) significantly outperform standard models on the Counterfactual subset

Ablation Analysis

Prompt Strategy Comparison

P1 and P2 prompt strategies show similar performance across different subsets
Suggests that specific prompt wording has limited impact on forgetting effectiveness

Model Capability Differences

LLaMA-3.1-405B performs best on Semantic subset (80.4%)
GPT-4o leads on Factual subset (86.0%)
All models perform poorly on Counterfactual subset

Machine Unlearning Domain

Traditional Methods: Implement specific data forgetting through retraining or parameter adjustment
Concept Unlearning: Enable models to forget specific concepts rather than data points
In-context Forgetting: Achieve forgetting through prompting with low computational cost

LLMs Temporal Prediction Applications

Application Scenarios: Weather forecasting, stock price prediction, traffic prediction, political event prediction
Method Types: Zero-shot learning, fine-tuning, in-context learning
Challenges: Data contamination leads to unfair evaluation

Data Contamination Research

Problem Identification: LLMs may memorize test samples in training data
Detection Methods: Identify potential contamination through statistical analysis
Mitigation Strategies: Prompt-based forgetting proposed in this paper represents a novel attempt

Conclusions and Discussion

Main Conclusions

Partial Effectiveness: Prompt-based forgetting performs well for direct factual queries but has limited effectiveness in scenarios requiring causal reasoning
Reasoning Dependency: Counterfactual prediction requires strong causal reasoning capabilities, which simple prompt constraints cannot achieve
Evaluation Necessity: Research results emphasize the importance of rigorous evaluation in LLMs' temporal prediction tasks

Limitations

Method Limitations: Only explores prompt-based forgetting without addressing other unlearning techniques
Dataset Scale: Relatively small dataset scale due to computational resource constraints
Missing Timestamps: Lack of timestamps in pretraining data may affect forgetting effectiveness
Instruction Fine-tuning: Models may not have undergone specialized training on knowledge cutoff prompts

Future Directions

Instruction Fine-tuning: Conduct specialized fine-tuning of models on knowledge cutoff prompts
Hybrid Methods: Combine prompt-based and parameter adjustment unlearning techniques
Larger-scale Evaluation: Construct larger-scale and more diverse evaluation datasets
Real-world Applications: Explore effectiveness in actual temporal prediction tasks

In-depth Evaluation

Strengths

Problem Importance: Addresses a critical issue in LLMs' temporal prediction evaluation with significant practical value
Method Novelty: First systematic study of prompt-based forgetting in temporal knowledge, opening new research directions
Evaluation Comprehensiveness: Three-dimensional dataset design is well-reasoned, comprehensively assessing different types of forgetting capabilities
Experimental Rigor:
- Multi-model comparison validates result reliability
- Detailed data construction and post-processing procedures
- Reasoning model comparison provides deep insights
Resource Openness: Provides complete datasets and evaluation code, facilitating subsequent research

Weaknesses

Insufficient Understanding of Forgetting Mechanisms: Lacks in-depth analysis of why certain types of forgetting are more difficult
Limited Prompt Optimization: Only tests two prompt strategies; more effective prompt designs may exist
Single Evaluation Metric: Primarily relies on success rate, lacking fine-grained assessment of forgetting degree
Missing Real-world Application Validation: Lacks verification of effectiveness in actual temporal prediction tasks
Computational Cost Analysis: Does not analyze computational efficiency advantages of prompt-based forgetting compared to traditional methods

Impact

Academic Contribution: Provides new perspectives and benchmarks for LLMs unlearning research, expected to advance related research development
Practical Value: Provides important evaluation frameworks for industrial applications of LLMs in temporal prediction
Methodological Significance: Emphasizes the importance of considering temporal factors in AI system evaluation
Reproducibility: Complete open-source resources ensure research reproducibility and extensibility

Applicable Scenarios

Financial Prediction: Fair evaluation of stock price and market trend prediction
Event Prediction: Political elections, sports events, and other event forecasting
Model Evaluation: Any LLM application evaluation involving time series
Research Benchmark: Serves as a benchmark dataset for evaluating other unlearning techniques

References

This paper cites important works from related fields including machine unlearning, LLMs temporal prediction, and data contamination, including:

Bourtoule et al. (2019): Foundational work on machine unlearning
Brown et al. (2020): GPT-3 and in-context learning
Pawelczyk et al. (2024): In-context forgetting techniques
Roberts et al. (2024): Longitudinal study of LLM data contamination

Overall Assessment: This is a high-quality research paper addressing an important problem in LLMs applications. While effectiveness is limited in causal reasoning forgetting, it provides important foundational work and evaluation frameworks for the field. The research methodology is rigorous, experimental design is sound, and it holds significant value for both academia and industry.