KnowRL: Teaching Language Models to Know What They Know
Kale, Dhami
Truly reliable AI requires more than simply scaling up knowledge; it demands the ability to know what it knows and when it does not. Yet recent research shows that even the best LLMs misjudge their own competence in more than one in five cases, making any response born of such internal uncertainty impossible to fully trust. Inspired by self-improvement reinforcement learning techniques that require minimal data, we present a simple but powerful framework KnowRL that strengthens a model's internal understanding of its own feasibility boundaries, enabling safer and more responsible behaviour. Our framework combines two components: (i) introspection, where the model generates and classifies tasks it judges feasible or infeasible, and (ii) consensus-based rewarding, where stability of self-knowledge assessment is reinforced through internal agreement. By using internally generated data, this design strengthens consistency in self-knowledge and entirely avoids costly external supervision. In experiments on LLaMA-3.1-8B and Qwen-2.5-7B, KnowRL steadily improved self-knowledge, validated by both intrinsic self-consistency and extrinsic benchmarking. With nothing more than a small seed set and no external supervision, our method drove gains as high as 28% in accuracy and 12% in F1, outperforming baselines in just a few iterations. Our framework essentially unlocks the untapped capacity of LLMs to self-improve their knowledge awareness, opening the door to reliable, more accountable AI and safer deployment in critical applications. Owing to its simplicity and independence from external effort, we encourage applying this reliability-enhancing process to all future models.
academic
KnowRL: Teaching Language Models to Know What They Know
Truly reliable AI requires not only expanding knowledge scale but also possessing the capability to "know what it knows and when it doesn't know." Research demonstrates that even the most advanced large language models (LLMs) misjudge their own capabilities in over one-fifth of cases, rendering responses based on intrinsic uncertainty untrustworthy. Inspired by self-improvement reinforcement learning techniques requiring minimal data, this paper proposes the KnowRL framework, which achieves safer and more responsible behavior by strengthening models' intrinsic understanding of their own feasibility boundaries. The framework combines two components: (i) an introspection mechanism where the model generates and classifies tasks it deems feasible or infeasible; (ii) a consensus-based reward mechanism that reinforces the stability of self-knowledge assessments through internal consistency. By utilizing internally generated data, the approach entirely avoids expensive external supervision. Experiments on LLaMA-3.1-8B and Qwen-2.5-7B demonstrate that KnowRL steadily improves self-knowledge capabilities, with accuracy improvements up to 28% and F1 score improvements of 12%.
The core problem addressed in this research is the lack of self-knowledge in large language models (LLMs)—the inability of models to accurately identify the boundaries of their own capabilities and to clearly distinguish which tasks are feasible versus infeasible.
Safety Concerns: Research shows that even leading LLMs misjudge their own capabilities in over 20% of cases, leading to serious trust and safety issues
Deployment Risks: In critical domains such as healthcare, law, and finance, both overconfidence and underconfidence in models can have severe consequences
Reliability Requirements: Truly reliable AI systems require metacognitive abilities to recognize the limitations of their own knowledge
External databases and scaffolding techniques are unsuitable for addressing such intrinsic deficiencies
Confidence calibration, while indicating potential answer errors, cannot guarantee that models maintain consistency about what they truly know and don't know
Lack of systematic methods to reinforce models' self-knowledge boundaries
The authors argue that LLMs inherently possess introspective capabilities that need to be guided and reinforced through reinforcement learning, enabling models to better understand and articulate their knowledge boundaries.
Proposes the KnowRL Framework: A reinforcement learning-based self-knowledge enhancement framework that improves LLMs' awareness of feasibility boundaries with limited initial data and without external supervision
Innovative Dual-Component Design:
Introspection Mechanism: LLM generates problems it considers feasible or infeasible
Significant Performance Improvements: Achieves accuracy improvements up to 28% and F1 score improvements of 12% within just a few iterations, demonstrating scalable self-improvement capabilities
Practicality and Scalability: The method is simple and independent of external resources, applicable to reliability enhancement for all future models
The self-knowledge task is defined as the model's ability to clearly distinguish between feasible and infeasible tasks based on its understanding of its own capabilities and knowledge boundaries. Input consists of task descriptions, with output being a binary classification of "Feasible" or "Infeasible," constrained by the requirement that judgments should be based on the model's true capability boundaries.
Function: The model autonomously generates tasks it considers feasible or infeasible
Implementation: Uses a small number of seed examples for guidance, with each introspection run producing 10-15 iterations and approximately 50-60 candidate tasks
Evolution Strategy: As training progresses, the model gradually refines and stabilizes its understanding of feasibility boundaries by combining the initial dataset with high-consensus samples from earlier stages
Objective: Quantify and reinforce consistency in self-knowledge
Method: For each candidate task x, extract k=8 independent self-analysis outputs {yi}, where yi ∈ {Feasible, Infeasible}
Reward Calculation:
r(x) = (1/k) * Σ[yi = Majority{y1, ..., yk}]
The reward represents the proportion of outputs consistent with the majority label, directly measuring the internal consistency of feasibility assessments
Stable Monotonic Improvement: Both models demonstrate clear monotonic improvement at nearly every checkpoint, reflecting steady intrinsic growth in understanding their own feasibility boundaries
Rapid Convergence: Maximum improvements occur in the first few training cycles, indicating that self-knowledge improvement can be cost-effective, predictable, and efficient
Improvement Plateau: Around iterations 25-30, progress begins to level off, suggesting natural limitations to intrinsic self-improvement
Feasible Task: Translate the English sentence "The cat sat on the mat" into French, maintaining identical meaning, tone, verb tense, and significance
Infeasible Task: Determine the exact cause of the Permian-Triassic extinction event, providing definitive conclusions supported by irrefutable evidence
These examples demonstrate the model's ability to accurately identify tasks within its translation capabilities and complex scientific problems exceeding its deterministic knowledge boundaries.
Limited Baseline Comparisons: Due to lack of direct comparison methods in the field, primarily compares against base models; lacks more comprehensive method comparisons
Restricted Evaluation Scope: Tested on only two medium-scale models; lacks validation on large-scale models
Unknown Long-term Effects: Relatively short training period; long-term improvement potential remains uncertain
Unverified Generalization: Tested only in English environment; cross-lingual generalization ability remains unknown
The paper cites abundant relevant literature, primarily including:
Self-knowledge and metacognition research 1-7
Reinforcement learning applications in LLMs 14, 22-24
Self-improvement and self-play methods 15, 30-32, 44-49
AI safety and reliability research 11-12, 16-17
Overall Assessment: This is a high-quality research paper that proposes innovative and practical solutions to the important problem of self-knowledge in LLMs. Despite certain limitations, its contributions are significant, the methodology is novel, experimental results are convincing, and it holds important implications for the AI safety field.