2025-11-13T17:28:10.587795

TextBandit: Evaluating Probabilistic Reasoning in LLMs Through Language-Only Decision Tasks

Lim, Damerla, Jiang et al.
Large language models (LLMs) have shown to be increasingly capable of performing reasoning tasks, but their ability to make sequential decisions under uncertainty only using natural language remains underexplored. We introduce a novel benchmark in which LLMs interact with multi-armed bandit environments using purely textual feedback, "you earned a token", without access to numerical cues or explicit probabilities, resulting in the model to infer latent reward structures purely off linguistic cues and to adapt accordingly. We evaluated the performance of four open-source LLMs and compare their performance to standard decision-making algorithms such as Thompson Sampling, Epsilon Greedy, Upper Confidence Bound (UCB), and random choice. While most of the LLMs underperformed compared to the baselines, Qwen3-4B, achieved the best-arm selection rate of 89.2% , which significantly outperformed both the larger LLMs and traditional methods. Our findings suggest that probabilistic reasoning is able to emerge from language alone, and we present this benchmark as a step towards evaluating decision-making capabilities in naturalistic, non-numeric contexts.
academic

TextBandit: Evaluating Probabilistic Reasoning in LLMs Through Language-Only Decision Tasks

Basic Information

  • Paper ID: 2510.13878
  • Title: TextBandit: Evaluating Probabilistic Reasoning in LLMs Through Language-Only Decision Tasks
  • Authors: Jimin Lim (UC Merced), Arjun Damerla (UC Berkeley), Arthur Jiang (Algoverse), Nam Le (Algoverse)
  • Classification: cs.CL (Computational Linguistics)
  • Publication Date: October 13, 2025 (arXiv preprint)
  • Paper Link: https://arxiv.org/abs/2510.13878

Abstract

Large language models (LLMs) demonstrate increasingly strong capabilities in reasoning tasks, yet their ability to perform sequential decision-making under uncertainty using only natural language remains underexplored. This paper introduces a novel benchmark in which LLMs interact with multi-armed bandit environments using pure text feedback ("you received a token"), without access to numerical cues or explicit probabilities. The model must infer the underlying reward structure based solely on linguistic signals and adjust accordingly. The study evaluates the performance of four open-source LLMs and compares them against standard decision-making algorithms including Thompson sampling, epsilon-greedy, upper confidence bound (UCB), and random selection. While most LLMs underperform baseline methods, Qwen3-4B achieves the best arm selection rate of 89.2%, significantly outperforming larger LLMs and traditional approaches.

Research Background and Motivation

Problem Definition

The core research question is: Can large language models perform effective probabilistic reasoning and decision-making in uncertain environments using only natural language feedback?

Significance

  1. Theoretical Value: Explores whether LLMs possess inherent Bayesian reasoning capabilities, which is crucial for understanding the cognitive mechanisms of AI systems
  2. Practical Value: In real-world scenarios, many decision contexts lack precise numerical data and rely solely on linguistic descriptions for judgment
  3. Technical Challenge: Traditional uncertainty-based decision methods depend on complex mathematical computations, while language-based approaches may offer more flexible and accessible solutions

Limitations of Existing Methods

  1. Numerical Dependency: Traditional Bayesian inference and reinforcement learning methods require explicit numerical inputs and probability information
  2. Evaluation Gap: Lack of specialized benchmarks for assessing LLMs' probabilistic reasoning capabilities in purely linguistic environments
  3. Complexity Constraints: Existing research primarily focuses on simple constrained tasks, insufficiently exploring multi-step decision scenarios

Research Motivation

The authors argue that if LLMs can perform effective probabilistic reasoning through language feedback alone, this would open new possibilities for natural, non-numerical decision-making, particularly in real-world applications lacking structured data.

Core Contributions

  1. Proposes TextBandit Benchmark: The first benchmark specifically designed to evaluate LLMs' probabilistic reasoning capabilities in pure linguistic environments using a multi-armed bandit framework
  2. Discovers Counterintuitive Scale Effects: Demonstrates a negative correlation between model size and decision-making performance, with the smaller Qwen3-4B significantly outperforming larger models
  3. Demonstrates Emergent Probabilistic Reasoning from Language: Proves that probabilistic reasoning capabilities can emerge from pure linguistic interactions without numerical cues
  4. Provides Comprehensive Comparative Analysis: Systematically compares LLMs with classical decision algorithms, offering important insights into the strengths and weaknesses of different approaches

Methodology Details

Task Definition

Input: Natural language descriptions of historical choices and outcomes (e.g., "slot machine 1 won," "slot machine 2 lost") Output: Arm selection for the next round (numeric ID, e.g., "1" or "2") Constraints: No numerical cues, no explicit probabilities, no intermediate reasoning processes

Experimental Architecture

Multi-Armed Bandit Environment

  • Number of Arms: 2-5 arms, each with fixed but unknown success probability
  • Reward Structure: In binary-arm configurations, one arm has 65% success rate, the other 30%
  • Feedback Mechanism:
    • Success: "You received a token" (reward = 1)
    • Failure: "You did not receive a token" (reward = 0)

Prompt Protocol

Each LLM uses a consistent prompt structure:

  1. Task Description: Natural language instructions framing the task in a decision context
  2. History Record: Pure linguistic descriptions of all previous choices and outcomes
  3. Action Request: Requesting the model to output the numeric ID corresponding to an arm

Evaluated Models

The study selected four open-source LLMs with different architectures and parameter scales:

ModelParametersArchitectureCharacteristics
Qwen3-4B4BDecoder-only TransformerMultilingual support, strong reasoning
Qwen3-8B8BDecoder-only TransformerLarger version of Qwen3-4B, enhanced tool-use
Llama-3.1-8B8BDecoder-only TransformerOptimized instruction-following and multilingual
Phi-22.7BTransformerSmall, efficient model

Baseline Methods

Compared four classical multi-armed bandit algorithms:

  1. Thompson Sampling: Uses Bayesian inference to sample from probability distributions
  2. Upper Confidence Bound (UCB): Deterministic strategy balancing exploitation and exploration
  3. Epsilon-Greedy: Selects best action with probability 1-ε, otherwise random selection
  4. Random Selection: Purely random baseline method

Experimental Setup

Experimental Configuration

  • Number of Trials: 500 independent runs per model
  • Decision Rounds: 25 decision rounds per run
  • Arm Configurations: Testing different configurations with 2-5 arms
  • Evaluation Environment: GPU instances hosted on RunPod, based on Hugging Face Transformers library

Evaluation Metrics

  1. Cumulative Reward: Total tokens obtained across 25 decision rounds
  2. Best Arm Selection Rate: Percentage frequency of selecting the optimal arm (65% success rate)
  3. Cumulative Regret: Opportunity cost of not selecting the optimal arm

Experimental Controls

  • Removed Chain-of-Thought reasoning for clear outputs
  • Used identical prompt format and structure
  • Single completion per decision step, no intermediate reasoning

Experimental Results

Main Results

Best Arm Selection Rate Comparison

Model/AlgorithmBest Arm Selection RateCumulative Reward
Qwen3-4B89.2%11,150
Thompson Sampling51.1%8,297
UCB47.6%4,696
Epsilon-Greedy38.1%6,029
Qwen3-8B37.5%4,686
Random Selection31.8%5,783
Llama-3.1-8B31.6%3,946
Phi-225.4%3,181

Key Findings

1. Counterintuitive Scale Effects

  • Qwen3-4B (4B parameters) significantly outperforms Qwen3-8B (8B parameters)
  • Larger models tend to "overthink," resulting in degraded decision performance
  • The smallest model Phi-2 (2.7B) performs worst, indicating an optimal size range exists

2. Impact of Number of Arms on Performance

Performance of all models significantly decreases as the number of arms increases:

  • Llama-3.1-8B: Drops from 31.56% (2 arms) to 7.37% (5 arms)
  • Qwen3-4B: Drops from 89.22% (2 arms) to 6.53% (5 arms)
  • Phi-2: Drops from 25.45% (2 arms) to 17.78% (5 arms)
  • Qwen3-8B: Drops from 37.49% (2 arms) to 17.09% (5 arms)

3. Cumulative Regret Analysis

  • Qwen3-4B exhibits rapid regret reduction in binary-arm configurations
  • Larger models maintain higher cumulative regret across all configurations
  • The 4-arm configuration unexpectedly produces the lowest cumulative regret among all models

Qualitative Analysis

  1. Exploration-Exploitation Strategy: LLMs exhibit behavior patterns similar to Thompson sampling
  2. Early Fixation: Models tend to prematurely determine "optimal" choices based on limited feedback
  3. Reasoning Overhead: Qwen3-8B exhibits unusually long processing times due to continuous reasoning attempts

Probabilistic Reasoning in LLMs

  • Xie et al. (2022): Frames in-context learning as implicit Bayesian inference
  • Gupta et al. (2025): Demonstrates that LLMs perform belief updates consistent with Bayesian posterior updates
  • Sun et al. (2025): Proposes hybrid approaches combining classical bandit strategies with LLM reward prediction

Uncertainty-Aware Decision Making

  • Felicioni et al. (2024): Explores benefits of explicitly considering epistemic uncertainty in sequential decision-making
  • Research shows uncertainty can serve as a valuable signal for guiding model behavior

Exploration-Exploitation in Bandit Environments

  • Zhang et al. (2025): Compares exploration-exploitation strategies between LLMs and humans in multi-armed bandits
  • Finds that Chain-of-Thought significantly enhances reasoning, making LLM behavior more similar to human approaches

Conclusions and Discussion

Main Conclusions

  1. Emergent Probabilistic Reasoning from Language: Demonstrates that effective probabilistic reasoning can emerge from language feedback alone
  2. Complex Relationship Between Scale and Performance: Model size does not always correlate positively with decision-making performance
  3. Importance of Architecture Optimization: Lightweight, efficient model architectures may have advantages in rapid feedback environments

Limitations

  1. Limited Model Range: Only tested open-source models with 2.7B-8B parameters, excluding larger-scale models
  2. Task Complexity: Static, simple reward structures without non-stationary environments or delayed feedback
  3. Prompting Strategy: Avoiding Chain-of-Thought may underestimate LLMs' true capabilities
  4. Computational Resource Constraints: Unable to test large commercial models like GPT-4

Future Directions

  1. Dynamic Environment Testing: Evaluate in non-stationary or delayed-reward bandit environments
  2. Guided Prompting: Combine Chain-of-Thought to study scaffolding effects on exploration-exploitation balance
  3. Scale Effect Research: Systematically investigate performance of larger-scale models and fine-tuned variants
  4. Multi-Step Planning: Extend to complex decision tasks requiring multi-step reasoning

In-Depth Evaluation

Strengths

  1. High Innovation: First framework for evaluating probabilistic reasoning in purely linguistic environments
  2. Important Findings: Reveals counterintuitive relationship between model size and decision-making performance
  3. Rigorous Experiments: 500 independent runs ensure statistical reliability of results
  4. Comprehensive Baselines: Systematic comparison with classical algorithms provides valuable reference
  5. Good Reproducibility: Provides complete code and detailed implementation descriptions

Weaknesses

  1. Insufficient Theoretical Explanation: Mechanism behind Qwen3-4B's superior performance lacks deep explanation
  2. Limited Model Selection: Lacks testing of larger-scale models
  3. Task Homogeneity: Focuses solely on bandit problems; generalizability remains to be verified
  4. Shallow Analysis: Lacks deeper mechanistic analysis of the "overthinking" phenomenon

Impact

  1. Academic Value: Provides new evaluation framework for understanding LLMs' probabilistic reasoning capabilities
  2. Practical Significance: Offers important reference for developing language-based decision systems
  3. Methodological Contribution: TextBandit benchmark may become a standard evaluation tool in the field
  4. Interdisciplinary Impact: Connects natural language processing, decision theory, and cognitive science

Applicable Scenarios

  1. Educational Assessment: Evaluating AI systems' decision-making capabilities in educational contexts
  2. Human-Computer Interaction: Designing more natural decision support systems
  3. Resource Allocation: Optimizing resource distribution in environments lacking precise data
  4. Game AI: Developing game agents based on linguistic feedback

References

This paper cites important works in probabilistic reasoning, uncertainty-aware decision-making, and multi-armed bandit domains, including:

  • Xie et al. (2022): Bayesian inference framework for in-context learning
  • Gupta et al. (2025): Bayesian belief update capabilities of LLMs
  • Zhang et al. (2025): Comparison of exploration-exploitation strategies between LLMs and humans
  • Felicioni et al. (2024): Uncertainty-aware sequential decision-making

Overall Assessment: This is a paper of significant innovative value that provides new perspectives on understanding LLMs' probabilistic reasoning capabilities through the TextBandit benchmark. Despite certain limitations, its findings regarding counterintuitive scale effects and emergent probabilistic reasoning from language hold important theoretical and practical significance for the field.