Can Prompts Rewind Time for LLMs? Evaluating the Effectiveness of Prompted Knowledge Cutoffs
Gao, Zhang, Du et al.
Large Language Models (LLMs) are widely used for temporal prediction, but their reliance on pretraining data raises contamination concerns, as accurate predictions on pre-cutoff test data may reflect memorization rather than reasoning, leading to an overestimation of their generalization capability. With the recent emergence of prompting-based unlearning techniques, a natural question arises: Can LLMs be prompted to simulate an earlier knowledge cutoff? In this work, we investigate the capability of prompting to simulate earlier knowledge cutoff in LLMs. We construct three evaluation datasets to assess the extent to which LLMs can forget (1) direct factual knowledge, (2) semantic shifts, and (3) causally related knowledge. Results demonstrate that while prompt-based simulated knowledge cutoffs show effectiveness when directly queried with the information after that date, they struggle to induce forgetting when the forgotten content is not directly asked but causally related to the query. These findings highlight the need for more rigorous evaluation settings when applying LLMs for temporal prediction tasks. The full dataset and evaluation code are available at https://github.com/gxx27/time_unlearn.
academic
Can Prompts Rewind Time for LLMs? Evaluating the Effectiveness of Prompted Knowledge Cutoffs
Large Language Models (LLMs) are widely applied to temporal prediction tasks, yet their reliance on pretraining data raises concerns about data contamination. Accurate predictions on test data preceding the pretraining cutoff date may reflect memorization rather than reasoning, leading to overestimation of generalization capabilities. With the emergence of prompt-based forgetting techniques, a natural question arises: can LLMs be prompted to simulate earlier knowledge cutoff dates? This study investigates the capability of prompts to simulate earlier knowledge cutoffs by constructing three evaluation datasets to assess the extent to which LLMs can forget (1) direct factual knowledge, (2) semantic shifts, and (3) causally-related knowledge. Results demonstrate that while prompt-based simulated knowledge cutoffs are effective when directly querying information after the specified date, they struggle to induce forgetting when the content to be forgotten is not directly queried but is causally related to the query.
The core problem addressed by this research is the impact of data contamination on the evaluation of LLMs in temporal prediction tasks. When LLMs are applied to time-series tasks such as stock prediction and event forecasting, models may have already encountered test data during the pretraining phase, leading to inflated prediction performance.
Evaluation Credibility: Traditional machine learning models (e.g., random forests) are trained from scratch, making test performance trustworthy; however, LLMs may have already seen data from the test period, leading to unfair comparisons
Inspired by recently emerging in-context forgetting techniques, the authors propose adjusting LLMs' knowledge cutoff dates through prompt engineering to make them "forget" all information after a specified date, thereby enabling fairer temporal prediction evaluation.
Input: User query + system prompt specifying simulated knowledge cutoff date
Output: Response conforming to the knowledge state of the specified cutoff date
Constraint: Model should "forget" all information after the cutoff date
You must answer this question using only knowledge that was publicly available before the year {unlearn_year}.
Specifically, your memory ends on December 31, {unlearn_year_minus_1}, and you have no access to anything that occurred in {unlearn_year} or afterward.
You may think step by step internally, but your memory ends on December 31, {unlearn_year_minus_1}.
You are strictly forbidden from referencing or reasoning about any information, event, or trend that emerged in {unlearn_year} or later.
Direct Query Effectiveness: Factual subset achieves an average success rate of 82.5%, indicating that prompt-based forgetting is relatively effective for direct factual queries
Moderate Semantic Forgetting: Semantic subset achieves an average success rate of 70.0%, showing that models can partially revert to historical word meanings
Causal Reasoning Difficulty: Counterfactual subset achieves only 19.2% success rate, revealing significant limitations of prompt-based forgetting
Reasoning Model Advantage: Reasoning-enhanced models (DeepSeek-R1: 71.2%, OpenAI o3: 50.6%) significantly outperform standard models on the Counterfactual subset
Partial Effectiveness: Prompt-based forgetting performs well for direct factual queries but has limited effectiveness in scenarios requiring causal reasoning
This paper cites important works from related fields including machine unlearning, LLMs temporal prediction, and data contamination, including:
Bourtoule et al. (2019): Foundational work on machine unlearning
Brown et al. (2020): GPT-3 and in-context learning
Pawelczyk et al. (2024): In-context forgetting techniques
Roberts et al. (2024): Longitudinal study of LLM data contamination
Overall Assessment: This is a high-quality research paper addressing an important problem in LLMs applications. While effectiveness is limited in causal reasoning forgetting, it provides important foundational work and evaluation frameworks for the field. The research methodology is rigorous, experimental design is sound, and it holds significant value for both academia and industry.