2025-11-13T17:28:10.587795

TextBandit: Evaluating Probabilistic Reasoning in LLMs Through Language-Only Decision Tasks

Lim, Damerla, Jiang et al.

Large language models (LLMs) have shown to be increasingly capable of performing reasoning tasks, but their ability to make sequential decisions under uncertainty only using natural language remains underexplored. We introduce a novel benchmark in which LLMs interact with multi-armed bandit environments using purely textual feedback, "you earned a token", without access to numerical cues or explicit probabilities, resulting in the model to infer latent reward structures purely off linguistic cues and to adapt accordingly. We evaluated the performance of four open-source LLMs and compare their performance to standard decision-making algorithms such as Thompson Sampling, Epsilon Greedy, Upper Confidence Bound (UCB), and random choice. While most of the LLMs underperformed compared to the baselines, Qwen3-4B, achieved the best-arm selection rate of 89.2% , which significantly outperformed both the larger LLMs and traditional methods. Our findings suggest that probabilistic reasoning is able to emerge from language alone, and we present this benchmark as a step towards evaluating decision-making capabilities in naturalistic, non-numeric contexts.

academic

TextBandit: Evaluating Probabilistic Reasoning in LLMs Through Language-Only Decision Tasks

Basic Information

Paper ID: 2510.13878
Title: TextBandit: Evaluating Probabilistic Reasoning in LLMs Through Language-Only Decision Tasks
Authors: Jimin Lim (UC Merced), Arjun Damerla (UC Berkeley), Arthur Jiang (Algoverse), Nam Le (Algoverse)
Classification: cs.CL (Computational Linguistics)
Publication Date: October 13, 2025 (arXiv preprint)
Paper Link: https://arxiv.org/abs/2510.13878

Abstract

Large language models (LLMs) demonstrate increasingly strong capabilities in reasoning tasks, yet their ability to perform sequential decision-making under uncertainty using only natural language remains underexplored. This paper introduces a novel benchmark in which LLMs interact with multi-armed bandit environments using pure text feedback ("you received a token"), without access to numerical cues or explicit probabilities. The model must infer the underlying reward structure based solely on linguistic signals and adjust accordingly. The study evaluates the performance of four open-source LLMs and compares them against standard decision-making algorithms including Thompson sampling, epsilon-greedy, upper confidence bound (UCB), and random selection. While most LLMs underperform baseline methods, Qwen3-4B achieves the best arm selection rate of 89.2%, significantly outperforming larger LLMs and traditional approaches.

Research Background and Motivation

Problem Definition

The core research question is: Can large language models perform effective probabilistic reasoning and decision-making in uncertain environments using only natural language feedback?

Significance

Theoretical Value: Explores whether LLMs possess inherent Bayesian reasoning capabilities, which is crucial for understanding the cognitive mechanisms of AI systems
Practical Value: In real-world scenarios, many decision contexts lack precise numerical data and rely solely on linguistic descriptions for judgment
Technical Challenge: Traditional uncertainty-based decision methods depend on complex mathematical computations, while language-based approaches may offer more flexible and accessible solutions

Limitations of Existing Methods

Numerical Dependency: Traditional Bayesian inference and reinforcement learning methods require explicit numerical inputs and probability information
Evaluation Gap: Lack of specialized benchmarks for assessing LLMs' probabilistic reasoning capabilities in purely linguistic environments
Complexity Constraints: Existing research primarily focuses on simple constrained tasks, insufficiently exploring multi-step decision scenarios

Research Motivation

The authors argue that if LLMs can perform effective probabilistic reasoning through language feedback alone, this would open new possibilities for natural, non-numerical decision-making, particularly in real-world applications lacking structured data.

Core Contributions

Proposes TextBandit Benchmark: The first benchmark specifically designed to evaluate LLMs' probabilistic reasoning capabilities in pure linguistic environments using a multi-armed bandit framework
Discovers Counterintuitive Scale Effects: Demonstrates a negative correlation between model size and decision-making performance, with the smaller Qwen3-4B significantly outperforming larger models
Demonstrates Emergent Probabilistic Reasoning from Language: Proves that probabilistic reasoning capabilities can emerge from pure linguistic interactions without numerical cues
Provides Comprehensive Comparative Analysis: Systematically compares LLMs with classical decision algorithms, offering important insights into the strengths and weaknesses of different approaches

Methodology Details

Task Definition

Input: Natural language descriptions of historical choices and outcomes (e.g., "slot machine 1 won," "slot machine 2 lost") Output: Arm selection for the next round (numeric ID, e.g., "1" or "2") Constraints: No numerical cues, no explicit probabilities, no intermediate reasoning processes

Experimental Architecture

Multi-Armed Bandit Environment

Number of Arms: 2-5 arms, each with fixed but unknown success probability
Reward Structure: In binary-arm configurations, one arm has 65% success rate, the other 30%
Feedback Mechanism:
- Success: "You received a token" (reward = 1)
- Failure: "You did not receive a token" (reward = 0)

Prompt Protocol

Each LLM uses a consistent prompt structure:

Task Description: Natural language instructions framing the task in a decision context
History Record: Pure linguistic descriptions of all previous choices and outcomes
Action Request: Requesting the model to output the numeric ID corresponding to an arm

Evaluated Models

The study selected four open-source LLMs with different architectures and parameter scales:

Model	Parameters	Architecture	Characteristics
Qwen3-4B	4B	Decoder-only Transformer	Multilingual support, strong reasoning
Qwen3-8B	8B	Decoder-only Transformer	Larger version of Qwen3-4B, enhanced tool-use
Llama-3.1-8B	8B	Decoder-only Transformer	Optimized instruction-following and multilingual
Phi-2	2.7B	Transformer	Small, efficient model

Baseline Methods

Compared four classical multi-armed bandit algorithms:

Thompson Sampling: Uses Bayesian inference to sample from probability distributions
Upper Confidence Bound (UCB): Deterministic strategy balancing exploitation and exploration
Epsilon-Greedy: Selects best action with probability 1-ε, otherwise random selection
Random Selection: Purely random baseline method

Experimental Setup

Experimental Configuration

Number of Trials: 500 independent runs per model
Decision Rounds: 25 decision rounds per run
Arm Configurations: Testing different configurations with 2-5 arms
Evaluation Environment: GPU instances hosted on RunPod, based on Hugging Face Transformers library

Evaluation Metrics

Cumulative Reward: Total tokens obtained across 25 decision rounds
Best Arm Selection Rate: Percentage frequency of selecting the optimal arm (65% success rate)
Cumulative Regret: Opportunity cost of not selecting the optimal arm

Experimental Controls

Removed Chain-of-Thought reasoning for clear outputs
Used identical prompt format and structure
Single completion per decision step, no intermediate reasoning

Experimental Results

Main Results

Best Arm Selection Rate Comparison

Model/Algorithm	Best Arm Selection Rate	Cumulative Reward
Qwen3-4B	89.2%	11,150
Thompson Sampling	51.1%	8,297
UCB	47.6%	4,696
Epsilon-Greedy	38.1%	6,029
Qwen3-8B	37.5%	4,686
Random Selection	31.8%	5,783
Llama-3.1-8B	31.6%	3,946
Phi-2	25.4%	3,181

Key Findings

1. Counterintuitive Scale Effects

Qwen3-4B (4B parameters) significantly outperforms Qwen3-8B (8B parameters)
Larger models tend to "overthink," resulting in degraded decision performance
The smallest model Phi-2 (2.7B) performs worst, indicating an optimal size range exists

2. Impact of Number of Arms on Performance

Performance of all models significantly decreases as the number of arms increases:

Llama-3.1-8B: Drops from 31.56% (2 arms) to 7.37% (5 arms)
Qwen3-4B: Drops from 89.22% (2 arms) to 6.53% (5 arms)
Phi-2: Drops from 25.45% (2 arms) to 17.78% (5 arms)
Qwen3-8B: Drops from 37.49% (2 arms) to 17.09% (5 arms)

3. Cumulative Regret Analysis

Qwen3-4B exhibits rapid regret reduction in binary-arm configurations
Larger models maintain higher cumulative regret across all configurations
The 4-arm configuration unexpectedly produces the lowest cumulative regret among all models

Qualitative Analysis

Exploration-Exploitation Strategy: LLMs exhibit behavior patterns similar to Thompson sampling
Early Fixation: Models tend to prematurely determine "optimal" choices based on limited feedback
Reasoning Overhead: Qwen3-8B exhibits unusually long processing times due to continuous reasoning attempts

Probabilistic Reasoning in LLMs

Xie et al. (2022): Frames in-context learning as implicit Bayesian inference
Gupta et al. (2025): Demonstrates that LLMs perform belief updates consistent with Bayesian posterior updates
Sun et al. (2025): Proposes hybrid approaches combining classical bandit strategies with LLM reward prediction

Uncertainty-Aware Decision Making

Felicioni et al. (2024): Explores benefits of explicitly considering epistemic uncertainty in sequential decision-making
Research shows uncertainty can serve as a valuable signal for guiding model behavior

Exploration-Exploitation in Bandit Environments

Zhang et al. (2025): Compares exploration-exploitation strategies between LLMs and humans in multi-armed bandits
Finds that Chain-of-Thought significantly enhances reasoning, making LLM behavior more similar to human approaches

Conclusions and Discussion

Main Conclusions

Emergent Probabilistic Reasoning from Language: Demonstrates that effective probabilistic reasoning can emerge from language feedback alone
Complex Relationship Between Scale and Performance: Model size does not always correlate positively with decision-making performance
Importance of Architecture Optimization: Lightweight, efficient model architectures may have advantages in rapid feedback environments

Limitations

Limited Model Range: Only tested open-source models with 2.7B-8B parameters, excluding larger-scale models
Task Complexity: Static, simple reward structures without non-stationary environments or delayed feedback
Prompting Strategy: Avoiding Chain-of-Thought may underestimate LLMs' true capabilities
Computational Resource Constraints: Unable to test large commercial models like GPT-4

Future Directions

Dynamic Environment Testing: Evaluate in non-stationary or delayed-reward bandit environments
Guided Prompting: Combine Chain-of-Thought to study scaffolding effects on exploration-exploitation balance
Scale Effect Research: Systematically investigate performance of larger-scale models and fine-tuned variants
Multi-Step Planning: Extend to complex decision tasks requiring multi-step reasoning

In-Depth Evaluation

Strengths

High Innovation: First framework for evaluating probabilistic reasoning in purely linguistic environments
Important Findings: Reveals counterintuitive relationship between model size and decision-making performance
Rigorous Experiments: 500 independent runs ensure statistical reliability of results
Comprehensive Baselines: Systematic comparison with classical algorithms provides valuable reference
Good Reproducibility: Provides complete code and detailed implementation descriptions

Weaknesses

Insufficient Theoretical Explanation: Mechanism behind Qwen3-4B's superior performance lacks deep explanation
Limited Model Selection: Lacks testing of larger-scale models
Task Homogeneity: Focuses solely on bandit problems; generalizability remains to be verified
Shallow Analysis: Lacks deeper mechanistic analysis of the "overthinking" phenomenon

Impact

Academic Value: Provides new evaluation framework for understanding LLMs' probabilistic reasoning capabilities
Practical Significance: Offers important reference for developing language-based decision systems
Methodological Contribution: TextBandit benchmark may become a standard evaluation tool in the field
Interdisciplinary Impact: Connects natural language processing, decision theory, and cognitive science

Applicable Scenarios

Educational Assessment: Evaluating AI systems' decision-making capabilities in educational contexts
Human-Computer Interaction: Designing more natural decision support systems
Resource Allocation: Optimizing resource distribution in environments lacking precise data
Game AI: Developing game agents based on linguistic feedback

References

This paper cites important works in probabilistic reasoning, uncertainty-aware decision-making, and multi-armed bandit domains, including:

Xie et al. (2022): Bayesian inference framework for in-context learning
Gupta et al. (2025): Bayesian belief update capabilities of LLMs
Zhang et al. (2025): Comparison of exploration-exploitation strategies between LLMs and humans
Felicioni et al. (2024): Uncertainty-aware sequential decision-making

Overall Assessment: This is a paper of significant innovative value that provides new perspectives on understanding LLMs' probabilistic reasoning capabilities through the TextBandit benchmark. Despite certain limitations, its findings regarding counterintuitive scale effects and emergent probabilistic reasoning from language hold important theoretical and practical significance for the field.