2025-11-24T19:28:17.728507

KnowRL: Teaching Language Models to Know What They Know

Kale, Dhami

Truly reliable AI requires more than simply scaling up knowledge; it demands the ability to know what it knows and when it does not. Yet recent research shows that even the best LLMs misjudge their own competence in more than one in five cases, making any response born of such internal uncertainty impossible to fully trust. Inspired by self-improvement reinforcement learning techniques that require minimal data, we present a simple but powerful framework KnowRL that strengthens a model's internal understanding of its own feasibility boundaries, enabling safer and more responsible behaviour. Our framework combines two components: (i) introspection, where the model generates and classifies tasks it judges feasible or infeasible, and (ii) consensus-based rewarding, where stability of self-knowledge assessment is reinforced through internal agreement. By using internally generated data, this design strengthens consistency in self-knowledge and entirely avoids costly external supervision. In experiments on LLaMA-3.1-8B and Qwen-2.5-7B, KnowRL steadily improved self-knowledge, validated by both intrinsic self-consistency and extrinsic benchmarking. With nothing more than a small seed set and no external supervision, our method drove gains as high as 28% in accuracy and 12% in F1, outperforming baselines in just a few iterations. Our framework essentially unlocks the untapped capacity of LLMs to self-improve their knowledge awareness, opening the door to reliable, more accountable AI and safer deployment in critical applications. Owing to its simplicity and independence from external effort, we encourage applying this reliability-enhancing process to all future models.

academic

KnowRL: Teaching Language Models to Know What They Know

Basic Information

Paper ID: 2510.11407
Title: KnowRL: Teaching Language Models to Know What They Know
Authors: Sahil Kale (KnowledgeVerse AI), Devendra Singh Dhami (TU Eindhoven)
Classification: cs.CL cs.AI
Publication Date: October 13, 2025 (arXiv preprint)
Paper Link: https://arxiv.org/abs/2510.11407

Abstract

Truly reliable AI requires not only expanding knowledge scale but also possessing the capability to "know what it knows and when it doesn't know." Research demonstrates that even the most advanced large language models (LLMs) misjudge their own capabilities in over one-fifth of cases, rendering responses based on intrinsic uncertainty untrustworthy. Inspired by self-improvement reinforcement learning techniques requiring minimal data, this paper proposes the KnowRL framework, which achieves safer and more responsible behavior by strengthening models' intrinsic understanding of their own feasibility boundaries. The framework combines two components: (i) an introspection mechanism where the model generates and classifies tasks it deems feasible or infeasible; (ii) a consensus-based reward mechanism that reinforces the stability of self-knowledge assessments through internal consistency. By utilizing internally generated data, the approach entirely avoids expensive external supervision. Experiments on LLaMA-3.1-8B and Qwen-2.5-7B demonstrate that KnowRL steadily improves self-knowledge capabilities, with accuracy improvements up to 28% and F1 score improvements of 12%.

Research Background and Motivation

Core Problem

The core problem addressed in this research is the lack of self-knowledge in large language models (LLMs)—the inability of models to accurately identify the boundaries of their own capabilities and to clearly distinguish which tasks are feasible versus infeasible.

Problem Significance

Safety Concerns: Research shows that even leading LLMs misjudge their own capabilities in over 20% of cases, leading to serious trust and safety issues
Deployment Risks: In critical domains such as healthcare, law, and finance, both overconfidence and underconfidence in models can have severe consequences
Reliability Requirements: Truly reliable AI systems require metacognitive abilities to recognize the limitations of their own knowledge

Limitations of Existing Approaches

External databases and scaffolding techniques are unsuitable for addressing such intrinsic deficiencies
Confidence calibration, while indicating potential answer errors, cannot guarantee that models maintain consistency about what they truly know and don't know
Lack of systematic methods to reinforce models' self-knowledge boundaries

Research Motivation

The authors argue that LLMs inherently possess introspective capabilities that need to be guided and reinforced through reinforcement learning, enabling models to better understand and articulate their knowledge boundaries.

Core Contributions

Proposes the KnowRL Framework: A reinforcement learning-based self-knowledge enhancement framework that improves LLMs' awareness of feasibility boundaries with limited initial data and without external supervision
Innovative Dual-Component Design:
- Introspection Mechanism: LLM generates problems it considers feasible or infeasible
- Consensus-based Reward Mechanism: Generates stable, trustworthy reward signals through internal consistency
Significant Performance Improvements: Achieves accuracy improvements up to 28% and F1 score improvements of 12% within just a few iterations, demonstrating scalable self-improvement capabilities
Practicality and Scalability: The method is simple and independent of external resources, applicable to reliability enhancement for all future models

Methodology Details

Task Definition

The self-knowledge task is defined as the model's ability to clearly distinguish between feasible and infeasible tasks based on its understanding of its own capabilities and knowledge boundaries. Input consists of task descriptions, with output being a binary classification of "Feasible" or "Infeasible," constrained by the requirement that judgments should be based on the model's true capability boundaries.

Model Architecture

Overall Framework

The KnowRL framework employs an iterative reinforcement learning training loop containing two core components:

![Framework](KnowRL Framework as shown in Figure 2)

1. Introspection Mechanism

Function: The model autonomously generates tasks it considers feasible or infeasible
Implementation: Uses a small number of seed examples for guidance, with each introspection run producing 10-15 iterations and approximately 50-60 candidate tasks
Evolution Strategy: As training progresses, the model gradually refines and stabilizes its understanding of feasibility boundaries by combining the initial dataset with high-consensus samples from earlier stages

2. Consensus-based Reward Mechanism

Objective: Quantify and reinforce consistency in self-knowledge
Method: For each candidate task x, extract k=8 independent self-analysis outputs {yi}, where yi ∈ {Feasible, Infeasible}
Reward Calculation:
```
r(x) = (1/k) * Σ[yi = Majority{y1, ..., yk}]
```
The reward represents the proportion of outputs consistent with the majority label, directly measuring the internal consistency of feasibility assessments

3. Reward Hacking Filter

To prevent models from gaming the consensus reward by generating overly simple or complex tasks, the following filtering strategies are employed:

Semantic Redundancy Filtering: Uses ROUGE-L score thresholds to filter semantically similar instructions
Keyword Filtering: Filters instructions containing keywords for image generation, model training, and other clearly out-of-scope capabilities
Perplexity Filtering: Uses negative log-likelihood from the base model to discard candidates with excessive perplexity

Technical Innovations

Self-Generated Data Strategy: Relies entirely on internally generated data, avoiding expensive manual annotation
Consensus Mechanism: Uses consistency across multiple samples as reward signals, providing stable and trustworthy learning signals
Self-Improvement Loop: Combines self-play reinforcement learning, enabling models to self-guide improvements in self-knowledge boundaries
Minimized External Dependency: Requires only small-scale seed datasets without external supervision

Experimental Setup

Datasets

Seed Dataset: 100 validated examples (50 feasible tasks, 50 infeasible tasks), generated by the model itself and verified by experts
Intrinsic Evaluation: Uses self-generated data for generation-verification consistency assessment
Extrinsic Evaluation: SelfAware dataset containing answerable and unanswerable questions with explanations

Evaluation Metrics

Intrinsic Evaluation: Accuracy—measures consistency of the generation-verification process
Extrinsic Evaluation: F1 Score—balances precision and recall on the SelfAware dataset

Baseline Methods

Given the lack of established methods for improving intrinsic self-knowledge, the base model performance serves as the evaluation baseline.

Implementation Details

Models: LLaMA-3.1-8B-Instruct and Qwen-2.5-7B-Instruct
RL Algorithm: Reinforce++ algorithm using the OpenRLHF framework
Training Parameters:
- Sampling count: k=8
- Introspection temperature: 1.0, self-analysis temperature: 0.0
- Learning rates: Actor 5×10⁻⁷, Critic 9×10⁻⁶
- Total iterations: 30, with evaluation every 5 iterations

Experimental Results

Main Results

Intrinsic Evaluation Results

Model	Iteration	Accuracy (%)	Improvement (%)
LLaMA-3.1-8B	Base	33.56	-
	Iteration 30	42.99	+9.43
Qwen-2.5-7B	Base	39.22	-
	Iteration 30	48.29	+9.07

Extrinsic Evaluation Results (SelfAware Dataset)

Model	Iteration	F1 Score (%)	Improvement (%)
LLaMA-3.1-8B	Base	56.12	-
	Iteration 30	63.10	+6.98
Qwen-2.5-7B	Base	62.17	-
	Iteration 30	68.29	+6.12

Key Findings

Stable Monotonic Improvement: Both models demonstrate clear monotonic improvement at nearly every checkpoint, reflecting steady intrinsic growth in understanding their own feasibility boundaries
Rapid Convergence: Maximum improvements occur in the first few training cycles, indicating that self-knowledge improvement can be cost-effective, predictable, and efficient
Improvement Plateau: Around iterations 25-30, progress begins to level off, suggesting natural limitations to intrinsic self-improvement

Case Analysis

LLaMA-3.1-8B Generation Examples at Iteration 25:

Feasible Task: Translate the English sentence "The cat sat on the mat" into French, maintaining identical meaning, tone, verb tense, and significance
Infeasible Task: Determine the exact cause of the Permian-Triassic extinction event, providing definitive conclusions supported by irrefutable evidence

These examples demonstrate the model's ability to accurately identify tasks within its translation capabilities and complex scientific problems exceeding its deterministic knowledge boundaries.

Self-Knowledge Research in LLMs

Problem Identification: Multiple studies identify inconsistency and instability in LLMs' self-knowledge
Evaluation Methods:
- Dataset-based binary classification of answerability
- Intrinsic evaluation based on internal consistency
- Self-awareness research
Improvement Methods: Self-Reflect, uncertainty-aware instruction tuning, and others

Self-Improvement in LLMs

Self-Refinement Methods: Self-Refine enables LLMs to generate initial answers followed by self-criticism and iterative improvement
Synthetic Data Methods: Self-Taught Evaluator, K2, and others use self-generated reasoning task datasets for training
Reinforcement Learning Methods: RLRF, R-Zero, SeRL, and others use post-hoc reinforcement or reward signals

Conclusions and Discussion

Main Conclusions

Effectiveness Validation: The KnowRL framework significantly improves LLMs' self-knowledge capabilities, achieving stable improvements across both models
Efficiency Advantages: Using only small-scale seed datasets and no external supervision, maximum improvements are achieved within a few iterations
Practical Value: Provides a concrete pathway for safe AI system deployment in critical domains

Limitations

Monolingual Limitation: All experiments are conducted only in English; effectiveness in multilingual and low-resource environments remains unknown
Training Scope Constraints: Due to computational constraints, performance beyond 30 iterations could not be explored
Scale Uncertainty: Evaluation is limited to models with fewer than 8B parameters; scalability to larger models remains unknown

Future Directions

Multilingual Extension: Test framework effectiveness across different languages and cultural contexts
Extended Training: Explore performance and improvement potential under longer training periods
Large-Scale Validation: Verify method scalability on models with larger parameter counts
Domain Specialization: Self-knowledge improvement tailored to specific domains (e.g., healthcare, law)

In-Depth Evaluation

Strengths

Strong Novelty: First systematic application of reinforcement learning to address LLMs' self-knowledge problem; the method is innovative and effective
High Practicality: Entirely based on internal data without requiring external supervision; easy to deploy and scale
Comprehensive Experiments: Uses both intrinsic and extrinsic evaluation approaches; results are consistent and convincing
Solid Theoretical Foundation: Based on self-play reinforcement learning theory with well-designed framework

Weaknesses

Limited Baseline Comparisons: Due to lack of direct comparison methods in the field, primarily compares against base models; lacks more comprehensive method comparisons
Restricted Evaluation Scope: Tested on only two medium-scale models; lacks validation on large-scale models
Unknown Long-term Effects: Relatively short training period; long-term improvement potential remains uncertain
Unverified Generalization: Tested only in English environment; cross-lingual generalization ability remains unknown

Impact

Academic Contribution: Provides new research directions and methodological frameworks for the AI safety field
Practical Value: Offers feasible solutions for deploying more reliable AI systems in practice
Reproducibility: Authors commit to releasing code and data, facilitating community follow-up research
Inspirational Significance: Demonstrates LLMs' self-improvement potential, likely to inspire further related research

Applicable Scenarios

High-Risk Applications: Medical diagnosis, legal consultation, financial decision-making, and other domains requiring high reliability
Educational Systems: Teaching applications requiring models to honestly express knowledge boundaries
Research Assistants: Research support tools requiring distinction between known and unknown knowledge boundaries
General AI Systems: Any AI applications requiring improved trustworthiness and safety

References

The paper cites abundant relevant literature, primarily including:

Self-knowledge and metacognition research 1-7
Reinforcement learning applications in LLMs 14, 22-24
Self-improvement and self-play methods 15, 30-32, 44-49
AI safety and reliability research 11-12, 16-17

Overall Assessment: This is a high-quality research paper that proposes innovative and practical solutions to the important problem of self-knowledge in LLMs. Despite certain limitations, its contributions are significant, the methodology is novel, experimental results are convincing, and it holds important implications for the AI safety field.