2025-11-24T19:28:17.728507

KnowRL: Teaching Language Models to Know What They Know

Kale, Dhami
Truly reliable AI requires more than simply scaling up knowledge; it demands the ability to know what it knows and when it does not. Yet recent research shows that even the best LLMs misjudge their own competence in more than one in five cases, making any response born of such internal uncertainty impossible to fully trust. Inspired by self-improvement reinforcement learning techniques that require minimal data, we present a simple but powerful framework KnowRL that strengthens a model's internal understanding of its own feasibility boundaries, enabling safer and more responsible behaviour. Our framework combines two components: (i) introspection, where the model generates and classifies tasks it judges feasible or infeasible, and (ii) consensus-based rewarding, where stability of self-knowledge assessment is reinforced through internal agreement. By using internally generated data, this design strengthens consistency in self-knowledge and entirely avoids costly external supervision. In experiments on LLaMA-3.1-8B and Qwen-2.5-7B, KnowRL steadily improved self-knowledge, validated by both intrinsic self-consistency and extrinsic benchmarking. With nothing more than a small seed set and no external supervision, our method drove gains as high as 28% in accuracy and 12% in F1, outperforming baselines in just a few iterations. Our framework essentially unlocks the untapped capacity of LLMs to self-improve their knowledge awareness, opening the door to reliable, more accountable AI and safer deployment in critical applications. Owing to its simplicity and independence from external effort, we encourage applying this reliability-enhancing process to all future models.
academic

KnowRL: Teaching Language Models to Know What They Know

Basic Information

  • Paper ID: 2510.11407
  • Title: KnowRL: Teaching Language Models to Know What They Know
  • Authors: Sahil Kale (KnowledgeVerse AI), Devendra Singh Dhami (TU Eindhoven)
  • Classification: cs.CL cs.AI
  • Publication Date: October 13, 2025 (arXiv preprint)
  • Paper Link: https://arxiv.org/abs/2510.11407

Abstract

Truly reliable AI requires not only expanding knowledge scale but also possessing the capability to "know what it knows and when it doesn't know." Research demonstrates that even the most advanced large language models (LLMs) misjudge their own capabilities in over one-fifth of cases, rendering responses based on intrinsic uncertainty untrustworthy. Inspired by self-improvement reinforcement learning techniques requiring minimal data, this paper proposes the KnowRL framework, which achieves safer and more responsible behavior by strengthening models' intrinsic understanding of their own feasibility boundaries. The framework combines two components: (i) an introspection mechanism where the model generates and classifies tasks it deems feasible or infeasible; (ii) a consensus-based reward mechanism that reinforces the stability of self-knowledge assessments through internal consistency. By utilizing internally generated data, the approach entirely avoids expensive external supervision. Experiments on LLaMA-3.1-8B and Qwen-2.5-7B demonstrate that KnowRL steadily improves self-knowledge capabilities, with accuracy improvements up to 28% and F1 score improvements of 12%.

Research Background and Motivation

Core Problem

The core problem addressed in this research is the lack of self-knowledge in large language models (LLMs)—the inability of models to accurately identify the boundaries of their own capabilities and to clearly distinguish which tasks are feasible versus infeasible.

Problem Significance

  1. Safety Concerns: Research shows that even leading LLMs misjudge their own capabilities in over 20% of cases, leading to serious trust and safety issues
  2. Deployment Risks: In critical domains such as healthcare, law, and finance, both overconfidence and underconfidence in models can have severe consequences
  3. Reliability Requirements: Truly reliable AI systems require metacognitive abilities to recognize the limitations of their own knowledge

Limitations of Existing Approaches

  1. External databases and scaffolding techniques are unsuitable for addressing such intrinsic deficiencies
  2. Confidence calibration, while indicating potential answer errors, cannot guarantee that models maintain consistency about what they truly know and don't know
  3. Lack of systematic methods to reinforce models' self-knowledge boundaries

Research Motivation

The authors argue that LLMs inherently possess introspective capabilities that need to be guided and reinforced through reinforcement learning, enabling models to better understand and articulate their knowledge boundaries.

Core Contributions

  1. Proposes the KnowRL Framework: A reinforcement learning-based self-knowledge enhancement framework that improves LLMs' awareness of feasibility boundaries with limited initial data and without external supervision
  2. Innovative Dual-Component Design:
    • Introspection Mechanism: LLM generates problems it considers feasible or infeasible
    • Consensus-based Reward Mechanism: Generates stable, trustworthy reward signals through internal consistency
  3. Significant Performance Improvements: Achieves accuracy improvements up to 28% and F1 score improvements of 12% within just a few iterations, demonstrating scalable self-improvement capabilities
  4. Practicality and Scalability: The method is simple and independent of external resources, applicable to reliability enhancement for all future models

Methodology Details

Task Definition

The self-knowledge task is defined as the model's ability to clearly distinguish between feasible and infeasible tasks based on its understanding of its own capabilities and knowledge boundaries. Input consists of task descriptions, with output being a binary classification of "Feasible" or "Infeasible," constrained by the requirement that judgments should be based on the model's true capability boundaries.

Model Architecture

Overall Framework

The KnowRL framework employs an iterative reinforcement learning training loop containing two core components:

![Framework](KnowRL Framework as shown in Figure 2)

1. Introspection Mechanism

  • Function: The model autonomously generates tasks it considers feasible or infeasible
  • Implementation: Uses a small number of seed examples for guidance, with each introspection run producing 10-15 iterations and approximately 50-60 candidate tasks
  • Evolution Strategy: As training progresses, the model gradually refines and stabilizes its understanding of feasibility boundaries by combining the initial dataset with high-consensus samples from earlier stages

2. Consensus-based Reward Mechanism

  • Objective: Quantify and reinforce consistency in self-knowledge
  • Method: For each candidate task x, extract k=8 independent self-analysis outputs {yi}, where yi ∈ {Feasible, Infeasible}
  • Reward Calculation:
    r(x) = (1/k) * Σ[yi = Majority{y1, ..., yk}]
    
    The reward represents the proportion of outputs consistent with the majority label, directly measuring the internal consistency of feasibility assessments

3. Reward Hacking Filter

To prevent models from gaming the consensus reward by generating overly simple or complex tasks, the following filtering strategies are employed:

  • Semantic Redundancy Filtering: Uses ROUGE-L score thresholds to filter semantically similar instructions
  • Keyword Filtering: Filters instructions containing keywords for image generation, model training, and other clearly out-of-scope capabilities
  • Perplexity Filtering: Uses negative log-likelihood from the base model to discard candidates with excessive perplexity

Technical Innovations

  1. Self-Generated Data Strategy: Relies entirely on internally generated data, avoiding expensive manual annotation
  2. Consensus Mechanism: Uses consistency across multiple samples as reward signals, providing stable and trustworthy learning signals
  3. Self-Improvement Loop: Combines self-play reinforcement learning, enabling models to self-guide improvements in self-knowledge boundaries
  4. Minimized External Dependency: Requires only small-scale seed datasets without external supervision

Experimental Setup

Datasets

  1. Seed Dataset: 100 validated examples (50 feasible tasks, 50 infeasible tasks), generated by the model itself and verified by experts
  2. Intrinsic Evaluation: Uses self-generated data for generation-verification consistency assessment
  3. Extrinsic Evaluation: SelfAware dataset containing answerable and unanswerable questions with explanations

Evaluation Metrics

  1. Intrinsic Evaluation: Accuracy—measures consistency of the generation-verification process
  2. Extrinsic Evaluation: F1 Score—balances precision and recall on the SelfAware dataset

Baseline Methods

Given the lack of established methods for improving intrinsic self-knowledge, the base model performance serves as the evaluation baseline.

Implementation Details

  • Models: LLaMA-3.1-8B-Instruct and Qwen-2.5-7B-Instruct
  • RL Algorithm: Reinforce++ algorithm using the OpenRLHF framework
  • Training Parameters:
    • Sampling count: k=8
    • Introspection temperature: 1.0, self-analysis temperature: 0.0
    • Learning rates: Actor 5×10⁻⁷, Critic 9×10⁻⁶
    • Total iterations: 30, with evaluation every 5 iterations

Experimental Results

Main Results

Intrinsic Evaluation Results

ModelIterationAccuracy (%)Improvement (%)
LLaMA-3.1-8BBase33.56-
Iteration 3042.99+9.43
Qwen-2.5-7BBase39.22-
Iteration 3048.29+9.07

Extrinsic Evaluation Results (SelfAware Dataset)

ModelIterationF1 Score (%)Improvement (%)
LLaMA-3.1-8BBase56.12-
Iteration 3063.10+6.98
Qwen-2.5-7BBase62.17-
Iteration 3068.29+6.12

Key Findings

  1. Stable Monotonic Improvement: Both models demonstrate clear monotonic improvement at nearly every checkpoint, reflecting steady intrinsic growth in understanding their own feasibility boundaries
  2. Rapid Convergence: Maximum improvements occur in the first few training cycles, indicating that self-knowledge improvement can be cost-effective, predictable, and efficient
  3. Improvement Plateau: Around iterations 25-30, progress begins to level off, suggesting natural limitations to intrinsic self-improvement

Case Analysis

LLaMA-3.1-8B Generation Examples at Iteration 25:

  • Feasible Task: Translate the English sentence "The cat sat on the mat" into French, maintaining identical meaning, tone, verb tense, and significance
  • Infeasible Task: Determine the exact cause of the Permian-Triassic extinction event, providing definitive conclusions supported by irrefutable evidence

These examples demonstrate the model's ability to accurately identify tasks within its translation capabilities and complex scientific problems exceeding its deterministic knowledge boundaries.

Self-Knowledge Research in LLMs

  1. Problem Identification: Multiple studies identify inconsistency and instability in LLMs' self-knowledge
  2. Evaluation Methods:
    • Dataset-based binary classification of answerability
    • Intrinsic evaluation based on internal consistency
    • Self-awareness research
  3. Improvement Methods: Self-Reflect, uncertainty-aware instruction tuning, and others

Self-Improvement in LLMs

  1. Self-Refinement Methods: Self-Refine enables LLMs to generate initial answers followed by self-criticism and iterative improvement
  2. Synthetic Data Methods: Self-Taught Evaluator, K2, and others use self-generated reasoning task datasets for training
  3. Reinforcement Learning Methods: RLRF, R-Zero, SeRL, and others use post-hoc reinforcement or reward signals

Conclusions and Discussion

Main Conclusions

  1. Effectiveness Validation: The KnowRL framework significantly improves LLMs' self-knowledge capabilities, achieving stable improvements across both models
  2. Efficiency Advantages: Using only small-scale seed datasets and no external supervision, maximum improvements are achieved within a few iterations
  3. Practical Value: Provides a concrete pathway for safe AI system deployment in critical domains

Limitations

  1. Monolingual Limitation: All experiments are conducted only in English; effectiveness in multilingual and low-resource environments remains unknown
  2. Training Scope Constraints: Due to computational constraints, performance beyond 30 iterations could not be explored
  3. Scale Uncertainty: Evaluation is limited to models with fewer than 8B parameters; scalability to larger models remains unknown

Future Directions

  1. Multilingual Extension: Test framework effectiveness across different languages and cultural contexts
  2. Extended Training: Explore performance and improvement potential under longer training periods
  3. Large-Scale Validation: Verify method scalability on models with larger parameter counts
  4. Domain Specialization: Self-knowledge improvement tailored to specific domains (e.g., healthcare, law)

In-Depth Evaluation

Strengths

  1. Strong Novelty: First systematic application of reinforcement learning to address LLMs' self-knowledge problem; the method is innovative and effective
  2. High Practicality: Entirely based on internal data without requiring external supervision; easy to deploy and scale
  3. Comprehensive Experiments: Uses both intrinsic and extrinsic evaluation approaches; results are consistent and convincing
  4. Solid Theoretical Foundation: Based on self-play reinforcement learning theory with well-designed framework

Weaknesses

  1. Limited Baseline Comparisons: Due to lack of direct comparison methods in the field, primarily compares against base models; lacks more comprehensive method comparisons
  2. Restricted Evaluation Scope: Tested on only two medium-scale models; lacks validation on large-scale models
  3. Unknown Long-term Effects: Relatively short training period; long-term improvement potential remains uncertain
  4. Unverified Generalization: Tested only in English environment; cross-lingual generalization ability remains unknown

Impact

  1. Academic Contribution: Provides new research directions and methodological frameworks for the AI safety field
  2. Practical Value: Offers feasible solutions for deploying more reliable AI systems in practice
  3. Reproducibility: Authors commit to releasing code and data, facilitating community follow-up research
  4. Inspirational Significance: Demonstrates LLMs' self-improvement potential, likely to inspire further related research

Applicable Scenarios

  1. High-Risk Applications: Medical diagnosis, legal consultation, financial decision-making, and other domains requiring high reliability
  2. Educational Systems: Teaching applications requiring models to honestly express knowledge boundaries
  3. Research Assistants: Research support tools requiring distinction between known and unknown knowledge boundaries
  4. General AI Systems: Any AI applications requiring improved trustworthiness and safety

References

The paper cites abundant relevant literature, primarily including:

  1. Self-knowledge and metacognition research 1-7
  2. Reinforcement learning applications in LLMs 14, 22-24
  3. Self-improvement and self-play methods 15, 30-32, 44-49
  4. AI safety and reliability research 11-12, 16-17

Overall Assessment: This is a high-quality research paper that proposes innovative and practical solutions to the important problem of self-knowledge in LLMs. Despite certain limitations, its contributions are significant, the methodology is novel, experimental results are convincing, and it holds important implications for the AI safety field.