2025-11-11T14:16:09.100728

Evaluating Human-LLM Representation Alignment: A Case Study on Affective Sentence Generation for Augmentative and Alternative Communication

Choudhury, Kumar, Martin

Gaps arise between a language model's use of concepts and people's expectations. This gap is critical when LLMs generate text to help people communicate via Augmentative and Alternative Communication (AAC) tools. In this work, we introduce the evaluation task of Representation Alignment for measuring this gap via human judgment. In our study, we expand keywords and emotion representations into full sentences. We select four emotion representations: Words, Valence-Arousal-Dominance (VAD) dimensions expressed in both Lexical and Numeric forms, and Emojis. In addition to Representation Alignment, we also measure people's judgments of the accuracy and realism of the generated sentences. While representations like VAD break emotions into easy-to-compute components, our findings show that people agree more with how LLMs generate when conditioned on English words (e.g., "angry") rather than VAD scales. This difference is especially visible when comparing Numeric VAD to words. Furthermore, we found that the perception of how much a generated sentence conveys an emotion is dependent on both the representation type and which emotion it is.

academic

Evaluating Human-LLM Representation Alignment: A Case Study on Affective Sentence Generation for Augmentative and Alternative Communication

Basic Information

Paper ID: 2503.11881
Title: Evaluating Human-LLM Representation Alignment: A Case Study on Affective Sentence Generation for Augmentative and Alternative Communication
Authors: Shadab Choudhury, Asha Kumar, Lara J. Martin (University of Maryland, Baltimore County)
Classification: cs.CL (Computational Linguistics)
Publication Date: 2025
Paper Link: https://arxiv.org/abs/2503.11881

Abstract

This study addresses the gap between Large Language Models' (LLMs) conceptual usage and human expectations, particularly in the context of Augmentative and Alternative Communication (AAC) tools. The research introduces "Representation Alignment" as an evaluation task, measuring this gap through human judgment. Four affective representation modalities are examined: English lexical terms, lexicalized VAD dimensions, numeric VAD dimensions, and emojis, with evaluation of generated sentence accuracy and authenticity. Results demonstrate that humans show greater alignment with LLM outputs generated under English lexical conditions compared to VAD scales, with this discrepancy being particularly pronounced in numeric VAD versus lexical comparisons.

Research Background and Motivation

Problem Definition

Core Issue: LLMs exhibit gaps in conceptual usage relative to human expectations, which is particularly critical in AAC tool applications
Application Context: AAC tools assist individuals with speech impairments in communication, with communication speed being a primary pain point
Technical Challenge: Ensuring that LLM-generated text accurately reflects users' affective intentions and expression preferences

Research Significance

AAC users frequently experience communication delays leading to being overlooked or interrupted
Current NLP technologies offer potential to enhance AAC tool communication speed
User concerns regarding LLM controllability, accuracy, and contextual appropriateness persist

Limitations of Existing Approaches

Lack of systematic evaluation of LLM-human alignment in conceptual understanding
Insufficient empirical evidence for selecting affective representation modalities
Inadequate consideration of different representation modalities' impact on user experience

Core Contributions

Proposes Representation Alignment Evaluation Paradigm: Introduces an evaluation methodology measuring the alignment between LLM conceptual usage and human mental models through human judgment
Systematically Compares Four Affective Representations: Comprehensively evaluates the effectiveness of Words, Lexical VAD, Numeric VAD, and Emojis representations
Identifies Optimal Representation Modality: Demonstrates that English lexical terms and lexicalized VAD perform best in representation alignment, accuracy, and authenticity
Provides AAC Application Guidance: Furnishes empirical evidence for selecting affective representations in future AAC applications

Methodology Details

Task Definition

Input: Three keywords + one affective representation
Output: Complete sentence containing keywords and expressing specified affect
Constraints: Generated sentences should be natural, accurately convey emotion, and avoid direct use of affective vocabulary

Affective Representation Modalities

1. Words Representation

Direct use of English affective vocabulary (e.g., "angry," "happy")

2. Lexical VAD Representation

Five-level lexical descriptions of VAD dimensions:

Valence: Very High/High/Moderate/Low/Very Low
Arousal: Degree of emotional activation
Dominance: Level of control over emotion

3. Numeric VAD Representation

Numeric scale from -5.0 to +5.0 representing VAD dimensions

4. Emojis Representation

Unicode emoji symbols representing emotions

Model Architecture and Generation Strategy

Models Used

GPT-4-Turbo-2024-04-09: Commercial API access
LLaMA-3.3-70B: 8-bit quantized version, locally deployed

Prompting Strategies

Words/Emojis: Few-shot prompting
VAD Representations: Step-back chain-of-thought prompting
Constraints: Prohibition of direct affective vocabulary use, requirement to "show rather than tell"

Data Generation

Total of 360 sentences per model (90 per representation modality)
Coverage of 18 distinct emotions from Demszky et al. (2020) classification
Two sentences per emotion randomly selected for evaluation

Experimental Setup

Dataset Construction

Emotion Selection: Based on Demszky et al. (2020) emotion classification, selecting 18 representative emotions
Keyword Combinations: Using common vocabulary combinations such as Place, Great, Korean, Finals, Semester, Math
VAD Numeric Values: Based on Guo and Choi (2021), normalized to -5.0 to +5.0 range

Human Evaluation Design

Participant Recruitment

Platform: Prolific crowdsourcing platform
Sample Size: 200 participants (100 per model)
Criteria: Age 18+, US-based, English fluent
Compensation: $14/hour, approximately 15-minute task

Evaluation Tasks

1. Representation Alignment Assessment

Display one affective representation and four generated sentences
Participants select the sentence best matching the emotion
Each participant answers 10 questions, randomly assigned

2. Accuracy and Authenticity Assessment

5-point Likert scale evaluation:
- "Convey": Degree to which sentence conveys emotion
- "You'd say": Sounds like something the participant would say
- "Someone Else'd say": Sounds like something others would say

Evaluation Metrics

Representation Alignment Metrics

Selection Rate: Percentage of times specific representation is selected
Shannon Entropy: Measures consistency of selections
Self-Alignment: Matching between generation and evaluation using same representation

Accuracy and Authenticity Metrics

Average Likert scores across three dimensions
ANOVA statistical significance testing
Paired t-tests for post-hoc analysis

Experimental Results

Primary Results

Representation Alignment Performance

Representation	GPT-4 Selection Rate	LLaMA-3 Selection Rate	GPT-4 Entropy	LLaMA-3 Entropy
Words	61.9%	57.5%	0.32	0.42
Lexical VAD	52.0%	-	0.61	0.72
Numeric VAD	-	-	0.70	0.63
Emojis	-	-	0.67	0.52

Key Findings

Words Representation Optimal: Demonstrates highest self-alignment rates and lowest entropy values across both models
Lexical VAD Secondary: Performs well on GPT-4 but shows diminished effectiveness on LLaMA-3
Numeric VAD Poorest Performance: Highest entropy values, indicating participant difficulty reaching consensus
Cross-Representation Alignment: Emojis and Lexical VAD show alignment on LLaMA-3

Accuracy and Authenticity Results

Statistical Significance

GPT-4: Affective representation significantly affects "Convey" and "You'd say" (p < 0.01)
LLaMA-3: Affective representation significantly affects "Convey" and "Someone Else'd say" (p < 0.05)

Pairwise Comparisons

Words significantly outperforms Numeric VAD on "Convey" dimension (GPT-4, p = 0.002)
Lexical VAD significantly outperforms Numeric VAD on "Convey" dimension (LLaMA-3, p = 0.018)
Words significantly outperforms Emojis (p = 0.005) and Numeric VAD (p = 0.044) on "You'd say" dimension

Emotion-Specific Analysis

Model Differences

GPT-4 markedly outperforms LLaMA-3 in generating "grateful" emotion sentences
Significant performance variations exist for different emotions across representation modalities
Certain emotions (e.g., "excited," "proud") show diminished performance under specific conditions

Representation Adaptability

Positive emotions typically perform better under Words representation
Complex emotional states are better suited to Lexical VAD representation
Numeric VAD encounters difficulties in fine-grained emotion differentiation

Ablation Studies

Keyword Adherence Analysis

Model	1 Keyword	2 Keywords	3 Keywords	Average Accuracy
GPT-4, 1x	1.00	1.00	0.936	0.978
LLaMA-3, 1x	0.908	0.897	0.781	0.862
LLaMA-3, 3x	0.969	0.969	0.850	0.930

VAD Training Effects

Providing participants with VAD concept explanations and practice questions improved comprehension accuracy, though cognitive load issues persist.

Keyword-Constrained Generation

Early grammar-based systems (Kasper, 1989; Uchimoto et al., 2002)
Sequence models and iterative refinement methods (Mou et al., 2016; He and Li, 2021)
Transformer-era controlled generation techniques (Kumar et al., 2021; Krause et al., 2021)

Emotion-Conditioned Sentence Generation

Early rule-based systems (Polzin and Waibel, 2000)
RNN-based conditional generation (Ghosh et al., 2017; Song et al., 2019)
LLM-era affective generation methods (Li et al., 2024; Mishra et al., 2023)

Value Alignment Research

Normative behavior learning in children's stories (Nahian et al., 2020)
Value integration in reinforcement learning from human feedback (Arzberger et al., 2024)
Value alignment measurement in existing models (Norhashim and Hahn, 2024)

Conclusions and Discussion

Main Conclusions

Importance of Representation Alignment: The degree of alignment between human and LLM conceptual understanding directly impacts application effectiveness
Superiority of Words Representation: English lexical terms provide the strongest alignment effect in affective representation
Complexity of VAD Representation: Lexicalized VAD outperforms numeric VAD, but remains inferior to direct lexical representation
Inter-Model Differences: Significant variations exist among different LLMs in emotion understanding and generation

Limitations

Technical Limitations

Model Selection: Only two LLMs tested, with LLaMA-3 using 8-bit quantization
Language Constraints: Limited to English; other languages may exhibit different results
Participant Representativeness: Does not include actual AAC user populations

Methodological Limitations

VAD Comprehension Burden: Participants require additional VAD concept learning, potentially affecting evaluation results
Emoji Subjectivity: Cross-cultural differences in emoji interpretation exist
Emotion Complexity: 18 emotions may not comprehensively cover the full emotional spectrum

Future Directions

Expanded Model Coverage: Testing additional state-of-the-art LLM models
Multilingual Validation: Verifying conclusions across other language environments
User Personalization: Personalized representation learning for specific AAC user populations
Real-Time Application: Deployment and evaluation in authentic AAC environments

In-Depth Evaluation

Strengths

Methodological Innovation

Novel Representation Alignment Paradigm: Provides systematic methodology for evaluating LLM conceptual understanding
Multi-Dimensional Evaluation Framework: Integrates alignment, accuracy, and authenticity assessment
Practice-Oriented Research: Directly addresses real-world AAC application requirements

Experimental Rigor

Large-Scale Human Evaluation: 200 crowdsourced participants ensure result reliability
Statistical Rigor: ANOVA and paired t-tests confirm result significance
Multi-Perspective Analysis: Comprehensive evaluation from alignment, accuracy, and authenticity dimensions

Result Convincingness

Consistent Findings: Result trends align across both models
Statistical Significance: Main conclusions pass statistical significance testing
Practical Guidance Value: Provides clear design recommendations for AAC applications

Weaknesses

Methodological Limitations

Evaluation Subjectivity: Reliance on human subjective judgment introduces potential bias
Task Simplification: Keyword-to-sentence generation is relatively simple; actual AAC scenarios are more complex
Static Assessment: Does not account for context dependency in dynamic dialogue

Experimental Design Flaws

Insufficient Participant Training: Rapid VAD concept training may be inadequate
Limited Sample Size: Relatively few respondents per question (3-9 individuals)
Model Version Differences: Model versions used may affect result timeliness

Impact Assessment

Academic Contribution

Pioneering Work: First systematic investigation of LLM representation alignment
Methodological Contribution: Representation alignment evaluation paradigm extensible to other conceptual domains
Interdisciplinary Value: Bridges NLP, psychology, and assistive technology research

Practical Value

AAC Tool Improvement: Provides guidance for affective representation design in AAC applications
LLM Optimization Direction: Offers insights for enhancing LLM-human conceptual alignment
Evaluation Standard Establishment: Establishes evaluation benchmarks for similar applications

Reproducibility

Detailed Method Description: Provides complete experimental setup and parameter configuration
Open Data Commitment: Pledges to release experimental data and code
Standardized Procedures: Establishes reproducible evaluation workflow

Applicable Scenarios

Direct Applications

AAC Tool Development: Design and optimization of affective expression functionality
Dialogue Systems: Enhanced emotional understanding and expression capabilities
Text Generation Evaluation: Establishes human-machine alignment evaluation standards

Extended Applications

Other Conceptual Alignments: Extension to values, cultural concepts, etc.
Multimodal Alignment: Integration of visual, audio, and other multimodal information
Personalized Adaptation: Customized alignment for specific user populations

References

This research cites extensive related work, primarily including:

Demszky et al. (2020): GoEmotions emotion dataset
Guo and Choi (2021): VAD emotion representation learning
Valencia et al. (2023): AI language model applications in AAC
Chen and Wan (2024): LLM lexical constraint generation capability evaluation

Overall Assessment: This is a high-quality research work making pioneering contributions to the important problem of LLM-human conceptual alignment. The research methodology is scientifically rigorous, experimental design is sound, and results possess significant theoretical and practical value. While certain limitations exist, the work establishes a solid foundation for future related research.