2025-11-18T08:22:12.824474

Can Prompts Rewind Time for LLMs? Evaluating the Effectiveness of Prompted Knowledge Cutoffs

Gao, Zhang, Du et al.
Large Language Models (LLMs) are widely used for temporal prediction, but their reliance on pretraining data raises contamination concerns, as accurate predictions on pre-cutoff test data may reflect memorization rather than reasoning, leading to an overestimation of their generalization capability. With the recent emergence of prompting-based unlearning techniques, a natural question arises: Can LLMs be prompted to simulate an earlier knowledge cutoff? In this work, we investigate the capability of prompting to simulate earlier knowledge cutoff in LLMs. We construct three evaluation datasets to assess the extent to which LLMs can forget (1) direct factual knowledge, (2) semantic shifts, and (3) causally related knowledge. Results demonstrate that while prompt-based simulated knowledge cutoffs show effectiveness when directly queried with the information after that date, they struggle to induce forgetting when the forgotten content is not directly asked but causally related to the query. These findings highlight the need for more rigorous evaluation settings when applying LLMs for temporal prediction tasks. The full dataset and evaluation code are available at https://github.com/gxx27/time_unlearn.
academic

Can Prompts Rewind Time for LLMs? Evaluating the Effectiveness of Prompted Knowledge Cutoffs

Basic Information

  • Paper ID: 2510.02340
  • Title: Can Prompts Rewind Time for LLMs? Evaluating the Effectiveness of Prompted Knowledge Cutoffs
  • Authors: Xin Gao, Ruiyi Zhang, Daniel Du, Saurabh Mahindre, Sai Ashish Somayajula, Pengtao Xie
  • Institutions: UC San Diego, SUNY Buffalo
  • Classification: cs.CL cs.LG
  • Publication Date: October 15, 2025 (arXiv v2)
  • Paper Link: https://arxiv.org/abs/2510.02340

Abstract

Large Language Models (LLMs) are widely applied to temporal prediction tasks, yet their reliance on pretraining data raises concerns about data contamination. Accurate predictions on test data preceding the pretraining cutoff date may reflect memorization rather than reasoning, leading to overestimation of generalization capabilities. With the emergence of prompt-based forgetting techniques, a natural question arises: can LLMs be prompted to simulate earlier knowledge cutoff dates? This study investigates the capability of prompts to simulate earlier knowledge cutoffs by constructing three evaluation datasets to assess the extent to which LLMs can forget (1) direct factual knowledge, (2) semantic shifts, and (3) causally-related knowledge. Results demonstrate that while prompt-based simulated knowledge cutoffs are effective when directly querying information after the specified date, they struggle to induce forgetting when the content to be forgotten is not directly queried but is causally related to the query.

Research Background and Motivation

Core Problem

The core problem addressed by this research is the impact of data contamination on the evaluation of LLMs in temporal prediction tasks. When LLMs are applied to time-series tasks such as stock prediction and event forecasting, models may have already encountered test data during the pretraining phase, leading to inflated prediction performance.

Problem Significance

  1. Evaluation Credibility: Traditional machine learning models (e.g., random forests) are trained from scratch, making test performance trustworthy; however, LLMs may have already seen data from the test period, leading to unfair comparisons
  2. Generalization Capability Misjudgment: Memory-based "predictions" overestimate the model's true generalization ability
  3. Practical Application Risk: When deployed in real-world scenarios, model performance may fall significantly short of expectations

Limitations of Existing Methods

  • Traditional machine unlearning methods require retraining or fine-tuning, incurring high computational costs
  • Lack of effective methods to simulate LLMs' historical knowledge states
  • Existing evaluation frameworks insufficiently address temporal contamination issues

Research Motivation

Inspired by recently emerging in-context forgetting techniques, the authors propose adjusting LLMs' knowledge cutoff dates through prompt engineering to make them "forget" all information after a specified date, thereby enabling fairer temporal prediction evaluation.

Core Contributions

  1. First Systematic Study: First systematic evaluation of the effectiveness of simulating LLMs' knowledge cutoff dates through prompting
  2. Multi-dimensional Evaluation Framework: Construction of three datasets across different dimensions to comprehensively assess forgetting capabilities:
    • Factual subset: Direct factual knowledge forgetting
    • Semantic subset: Semantic shift forgetting
    • Counterfactual subset: Causally-related knowledge forgetting
  3. Important Findings: Reveals limitations of prompt-based forgetting—significant performance degradation in causal reasoning scenarios
  4. Evaluation Benchmark: Provides high-quality datasets and evaluation code, establishing a foundation for future research
  5. Practical Guidance: Provides methodological guidance for rigorous evaluation of LLMs in temporal prediction tasks

Methodology Details

Task Definition

Input: User query + system prompt specifying simulated knowledge cutoff date Output: Response conforming to the knowledge state of the specified cutoff date Constraint: Model should "forget" all information after the cutoff date

Prompt Design Strategies

Prompt P1: Knowledge Filtering Type

You must answer this question using only knowledge that was publicly available before the year {unlearn_year}. 
Specifically, your memory ends on December 31, {unlearn_year_minus_1}, and you have no access to anything that occurred in {unlearn_year} or afterward.

Prompt P2: Reasoning Constraint Type

You may think step by step internally, but your memory ends on December 31, {unlearn_year_minus_1}. 
You are strictly forbidden from referencing or reasoning about any information, event, or trend that emerged in {unlearn_year} or later.

Evaluation Method Design

Forgetting Success Rate Calculation

For Factual and Counterfactual subsets, multiple-choice format is used, with forgetting success defined as the model changing its original answer.

For the Semantic subset, semantic similarity is employed: Success=cos(oa,ya)cos(oa,ya)+cos(oa,yb)>cos(ob,ya)cos(ob,ya)+cos(ob,yb)\text{Success} = \frac{\cos(o_a, y_a)}{\cos(o_a, y_a) + \cos(o_a, y_b)} > \frac{\cos(o_b, y_a)}{\cos(o_b, y_a) + \cos(o_b, y_b)}

where oa,obo_a, o_b represent outputs before and after forgetting, and ya,yby_a, y_b represent ground truth answers before and after the cutoff date.

Experimental Setup

Dataset Construction

Factual Subset (675 samples)

  • Objective: Evaluate direct factual knowledge forgetting
  • Construction Method: Use GPT-4o to generate major historical events and corresponding Q&A pairs since 1960
  • Time Span: 1960-2024
  • Example: Querying the U.S. President at a specific time point should return the incumbent at that time rather than the current president

Semantic Subset (303 samples)

  • Objective: Evaluate semantic shift forgetting
  • Construction Method: Collect words with semantic changes, such as "TikTok" evolving from an onomatopoeia to a social media platform
  • Time Span: 2000-2024
  • Evaluation: Use MPNet model to compute semantic similarity

Counterfactual Subset (689 samples)

  • Objective: Evaluate causally-related knowledge forgetting
  • Construction Method: Construct counterfactual prediction scenarios based on major events
  • Time Span: 2000-2024
  • Example: Predict the year of the Tokyo Olympics with a 2018 cutoff (should answer 2020 rather than the actual 2021)

Experimental Models

  • DeepSeek-V3: Latest open-source model
  • LLaMA-3.1-405B: Meta's large-scale model
  • GPT-4o: OpenAI's multimodal model
  • DeepSeek-R1 & OpenAI o3: Reasoning-enhanced models (comparative experiments)

Evaluation Metrics

  • Primary Metric: Unlearn Success Rate
  • Calculation Method: Number of successfully forgotten samples / Total number of samples

Experimental Results

Main Results

ModelFactualSemanticCounterfactual
DeepSeek-V379.0%57.5%13.9%
LLaMA-3.1-405B82.4%80.4%26.5%
GPT-4o86.0%72.0%17.3%
Average82.5%70.0%19.2%

Key Findings

  1. Direct Query Effectiveness: Factual subset achieves an average success rate of 82.5%, indicating that prompt-based forgetting is relatively effective for direct factual queries
  2. Moderate Semantic Forgetting: Semantic subset achieves an average success rate of 70.0%, showing that models can partially revert to historical word meanings
  3. Causal Reasoning Difficulty: Counterfactual subset achieves only 19.2% success rate, revealing significant limitations of prompt-based forgetting
  4. Reasoning Model Advantage: Reasoning-enhanced models (DeepSeek-R1: 71.2%, OpenAI o3: 50.6%) significantly outperform standard models on the Counterfactual subset

Ablation Analysis

Prompt Strategy Comparison

  • P1 and P2 prompt strategies show similar performance across different subsets
  • Suggests that specific prompt wording has limited impact on forgetting effectiveness

Model Capability Differences

  • LLaMA-3.1-405B performs best on Semantic subset (80.4%)
  • GPT-4o leads on Factual subset (86.0%)
  • All models perform poorly on Counterfactual subset

Machine Unlearning Domain

  • Traditional Methods: Implement specific data forgetting through retraining or parameter adjustment
  • Concept Unlearning: Enable models to forget specific concepts rather than data points
  • In-context Forgetting: Achieve forgetting through prompting with low computational cost

LLMs Temporal Prediction Applications

  • Application Scenarios: Weather forecasting, stock price prediction, traffic prediction, political event prediction
  • Method Types: Zero-shot learning, fine-tuning, in-context learning
  • Challenges: Data contamination leads to unfair evaluation

Data Contamination Research

  • Problem Identification: LLMs may memorize test samples in training data
  • Detection Methods: Identify potential contamination through statistical analysis
  • Mitigation Strategies: Prompt-based forgetting proposed in this paper represents a novel attempt

Conclusions and Discussion

Main Conclusions

  1. Partial Effectiveness: Prompt-based forgetting performs well for direct factual queries but has limited effectiveness in scenarios requiring causal reasoning
  2. Reasoning Dependency: Counterfactual prediction requires strong causal reasoning capabilities, which simple prompt constraints cannot achieve
  3. Evaluation Necessity: Research results emphasize the importance of rigorous evaluation in LLMs' temporal prediction tasks

Limitations

  1. Method Limitations: Only explores prompt-based forgetting without addressing other unlearning techniques
  2. Dataset Scale: Relatively small dataset scale due to computational resource constraints
  3. Missing Timestamps: Lack of timestamps in pretraining data may affect forgetting effectiveness
  4. Instruction Fine-tuning: Models may not have undergone specialized training on knowledge cutoff prompts

Future Directions

  1. Instruction Fine-tuning: Conduct specialized fine-tuning of models on knowledge cutoff prompts
  2. Hybrid Methods: Combine prompt-based and parameter adjustment unlearning techniques
  3. Larger-scale Evaluation: Construct larger-scale and more diverse evaluation datasets
  4. Real-world Applications: Explore effectiveness in actual temporal prediction tasks

In-depth Evaluation

Strengths

  1. Problem Importance: Addresses a critical issue in LLMs' temporal prediction evaluation with significant practical value
  2. Method Novelty: First systematic study of prompt-based forgetting in temporal knowledge, opening new research directions
  3. Evaluation Comprehensiveness: Three-dimensional dataset design is well-reasoned, comprehensively assessing different types of forgetting capabilities
  4. Experimental Rigor:
    • Multi-model comparison validates result reliability
    • Detailed data construction and post-processing procedures
    • Reasoning model comparison provides deep insights
  5. Resource Openness: Provides complete datasets and evaluation code, facilitating subsequent research

Weaknesses

  1. Insufficient Understanding of Forgetting Mechanisms: Lacks in-depth analysis of why certain types of forgetting are more difficult
  2. Limited Prompt Optimization: Only tests two prompt strategies; more effective prompt designs may exist
  3. Single Evaluation Metric: Primarily relies on success rate, lacking fine-grained assessment of forgetting degree
  4. Missing Real-world Application Validation: Lacks verification of effectiveness in actual temporal prediction tasks
  5. Computational Cost Analysis: Does not analyze computational efficiency advantages of prompt-based forgetting compared to traditional methods

Impact

  1. Academic Contribution: Provides new perspectives and benchmarks for LLMs unlearning research, expected to advance related research development
  2. Practical Value: Provides important evaluation frameworks for industrial applications of LLMs in temporal prediction
  3. Methodological Significance: Emphasizes the importance of considering temporal factors in AI system evaluation
  4. Reproducibility: Complete open-source resources ensure research reproducibility and extensibility

Applicable Scenarios

  1. Financial Prediction: Fair evaluation of stock price and market trend prediction
  2. Event Prediction: Political elections, sports events, and other event forecasting
  3. Model Evaluation: Any LLM application evaluation involving time series
  4. Research Benchmark: Serves as a benchmark dataset for evaluating other unlearning techniques

References

This paper cites important works from related fields including machine unlearning, LLMs temporal prediction, and data contamination, including:

  • Bourtoule et al. (2019): Foundational work on machine unlearning
  • Brown et al. (2020): GPT-3 and in-context learning
  • Pawelczyk et al. (2024): In-context forgetting techniques
  • Roberts et al. (2024): Longitudinal study of LLM data contamination

Overall Assessment: This is a high-quality research paper addressing an important problem in LLMs applications. While effectiveness is limited in causal reasoning forgetting, it provides important foundational work and evaluation frameworks for the field. The research methodology is rigorous, experimental design is sound, and it holds significant value for both academia and industry.