Gaps arise between a language model's use of concepts and people's expectations. This gap is critical when LLMs generate text to help people communicate via Augmentative and Alternative Communication (AAC) tools. In this work, we introduce the evaluation task of Representation Alignment for measuring this gap via human judgment. In our study, we expand keywords and emotion representations into full sentences. We select four emotion representations: Words, Valence-Arousal-Dominance (VAD) dimensions expressed in both Lexical and Numeric forms, and Emojis. In addition to Representation Alignment, we also measure people's judgments of the accuracy and realism of the generated sentences. While representations like VAD break emotions into easy-to-compute components, our findings show that people agree more with how LLMs generate when conditioned on English words (e.g., "angry") rather than VAD scales. This difference is especially visible when comparing Numeric VAD to words. Furthermore, we found that the perception of how much a generated sentence conveys an emotion is dependent on both the representation type and which emotion it is.
Evaluating Human-LLM Representation Alignment: A Case Study on Affective Sentence Generation for Augmentative and Alternative Communication
- Paper ID: 2503.11881
- Title: Evaluating Human-LLM Representation Alignment: A Case Study on Affective Sentence Generation for Augmentative and Alternative Communication
- Authors: Shadab Choudhury, Asha Kumar, Lara J. Martin (University of Maryland, Baltimore County)
- Classification: cs.CL (Computational Linguistics)
- Publication Date: 2025
- Paper Link: https://arxiv.org/abs/2503.11881
This study addresses the gap between Large Language Models' (LLMs) conceptual usage and human expectations, particularly in the context of Augmentative and Alternative Communication (AAC) tools. The research introduces "Representation Alignment" as an evaluation task, measuring this gap through human judgment. Four affective representation modalities are examined: English lexical terms, lexicalized VAD dimensions, numeric VAD dimensions, and emojis, with evaluation of generated sentence accuracy and authenticity. Results demonstrate that humans show greater alignment with LLM outputs generated under English lexical conditions compared to VAD scales, with this discrepancy being particularly pronounced in numeric VAD versus lexical comparisons.
- Core Issue: LLMs exhibit gaps in conceptual usage relative to human expectations, which is particularly critical in AAC tool applications
- Application Context: AAC tools assist individuals with speech impairments in communication, with communication speed being a primary pain point
- Technical Challenge: Ensuring that LLM-generated text accurately reflects users' affective intentions and expression preferences
- AAC users frequently experience communication delays leading to being overlooked or interrupted
- Current NLP technologies offer potential to enhance AAC tool communication speed
- User concerns regarding LLM controllability, accuracy, and contextual appropriateness persist
- Lack of systematic evaluation of LLM-human alignment in conceptual understanding
- Insufficient empirical evidence for selecting affective representation modalities
- Inadequate consideration of different representation modalities' impact on user experience
- Proposes Representation Alignment Evaluation Paradigm: Introduces an evaluation methodology measuring the alignment between LLM conceptual usage and human mental models through human judgment
- Systematically Compares Four Affective Representations: Comprehensively evaluates the effectiveness of Words, Lexical VAD, Numeric VAD, and Emojis representations
- Identifies Optimal Representation Modality: Demonstrates that English lexical terms and lexicalized VAD perform best in representation alignment, accuracy, and authenticity
- Provides AAC Application Guidance: Furnishes empirical evidence for selecting affective representations in future AAC applications
- Input: Three keywords + one affective representation
- Output: Complete sentence containing keywords and expressing specified affect
- Constraints: Generated sentences should be natural, accurately convey emotion, and avoid direct use of affective vocabulary
Direct use of English affective vocabulary (e.g., "angry," "happy")
Five-level lexical descriptions of VAD dimensions:
- Valence: Very High/High/Moderate/Low/Very Low
- Arousal: Degree of emotional activation
- Dominance: Level of control over emotion
Numeric scale from -5.0 to +5.0 representing VAD dimensions
Unicode emoji symbols representing emotions
- GPT-4-Turbo-2024-04-09: Commercial API access
- LLaMA-3.3-70B: 8-bit quantized version, locally deployed
- Words/Emojis: Few-shot prompting
- VAD Representations: Step-back chain-of-thought prompting
- Constraints: Prohibition of direct affective vocabulary use, requirement to "show rather than tell"
- Total of 360 sentences per model (90 per representation modality)
- Coverage of 18 distinct emotions from Demszky et al. (2020) classification
- Two sentences per emotion randomly selected for evaluation
- Emotion Selection: Based on Demszky et al. (2020) emotion classification, selecting 18 representative emotions
- Keyword Combinations: Using common vocabulary combinations such as Place, Great, Korean, Finals, Semester, Math
- VAD Numeric Values: Based on Guo and Choi (2021), normalized to -5.0 to +5.0 range
- Platform: Prolific crowdsourcing platform
- Sample Size: 200 participants (100 per model)
- Criteria: Age 18+, US-based, English fluent
- Compensation: $14/hour, approximately 15-minute task
1. Representation Alignment Assessment
- Display one affective representation and four generated sentences
- Participants select the sentence best matching the emotion
- Each participant answers 10 questions, randomly assigned
2. Accuracy and Authenticity Assessment
- 5-point Likert scale evaluation:
- "Convey": Degree to which sentence conveys emotion
- "You'd say": Sounds like something the participant would say
- "Someone Else'd say": Sounds like something others would say
- Selection Rate: Percentage of times specific representation is selected
- Shannon Entropy: Measures consistency of selections
- Self-Alignment: Matching between generation and evaluation using same representation
- Average Likert scores across three dimensions
- ANOVA statistical significance testing
- Paired t-tests for post-hoc analysis
| Representation | GPT-4 Selection Rate | LLaMA-3 Selection Rate | GPT-4 Entropy | LLaMA-3 Entropy |
|---|
| Words | 61.9% | 57.5% | 0.32 | 0.42 |
| Lexical VAD | 52.0% | - | 0.61 | 0.72 |
| Numeric VAD | - | - | 0.70 | 0.63 |
| Emojis | - | - | 0.67 | 0.52 |
- Words Representation Optimal: Demonstrates highest self-alignment rates and lowest entropy values across both models
- Lexical VAD Secondary: Performs well on GPT-4 but shows diminished effectiveness on LLaMA-3
- Numeric VAD Poorest Performance: Highest entropy values, indicating participant difficulty reaching consensus
- Cross-Representation Alignment: Emojis and Lexical VAD show alignment on LLaMA-3
- GPT-4: Affective representation significantly affects "Convey" and "You'd say" (p < 0.01)
- LLaMA-3: Affective representation significantly affects "Convey" and "Someone Else'd say" (p < 0.05)
- Words significantly outperforms Numeric VAD on "Convey" dimension (GPT-4, p = 0.002)
- Lexical VAD significantly outperforms Numeric VAD on "Convey" dimension (LLaMA-3, p = 0.018)
- Words significantly outperforms Emojis (p = 0.005) and Numeric VAD (p = 0.044) on "You'd say" dimension
- GPT-4 markedly outperforms LLaMA-3 in generating "grateful" emotion sentences
- Significant performance variations exist for different emotions across representation modalities
- Certain emotions (e.g., "excited," "proud") show diminished performance under specific conditions
- Positive emotions typically perform better under Words representation
- Complex emotional states are better suited to Lexical VAD representation
- Numeric VAD encounters difficulties in fine-grained emotion differentiation
| Model | 1 Keyword | 2 Keywords | 3 Keywords | Average Accuracy |
|---|
| GPT-4, 1x | 1.00 | 1.00 | 0.936 | 0.978 |
| LLaMA-3, 1x | 0.908 | 0.897 | 0.781 | 0.862 |
| LLaMA-3, 3x | 0.969 | 0.969 | 0.850 | 0.930 |
Providing participants with VAD concept explanations and practice questions improved comprehension accuracy, though cognitive load issues persist.
- Early grammar-based systems (Kasper, 1989; Uchimoto et al., 2002)
- Sequence models and iterative refinement methods (Mou et al., 2016; He and Li, 2021)
- Transformer-era controlled generation techniques (Kumar et al., 2021; Krause et al., 2021)
- Early rule-based systems (Polzin and Waibel, 2000)
- RNN-based conditional generation (Ghosh et al., 2017; Song et al., 2019)
- LLM-era affective generation methods (Li et al., 2024; Mishra et al., 2023)
- Normative behavior learning in children's stories (Nahian et al., 2020)
- Value integration in reinforcement learning from human feedback (Arzberger et al., 2024)
- Value alignment measurement in existing models (Norhashim and Hahn, 2024)
- Importance of Representation Alignment: The degree of alignment between human and LLM conceptual understanding directly impacts application effectiveness
- Superiority of Words Representation: English lexical terms provide the strongest alignment effect in affective representation
- Complexity of VAD Representation: Lexicalized VAD outperforms numeric VAD, but remains inferior to direct lexical representation
- Inter-Model Differences: Significant variations exist among different LLMs in emotion understanding and generation
- Model Selection: Only two LLMs tested, with LLaMA-3 using 8-bit quantization
- Language Constraints: Limited to English; other languages may exhibit different results
- Participant Representativeness: Does not include actual AAC user populations
- VAD Comprehension Burden: Participants require additional VAD concept learning, potentially affecting evaluation results
- Emoji Subjectivity: Cross-cultural differences in emoji interpretation exist
- Emotion Complexity: 18 emotions may not comprehensively cover the full emotional spectrum
- Expanded Model Coverage: Testing additional state-of-the-art LLM models
- Multilingual Validation: Verifying conclusions across other language environments
- User Personalization: Personalized representation learning for specific AAC user populations
- Real-Time Application: Deployment and evaluation in authentic AAC environments
- Novel Representation Alignment Paradigm: Provides systematic methodology for evaluating LLM conceptual understanding
- Multi-Dimensional Evaluation Framework: Integrates alignment, accuracy, and authenticity assessment
- Practice-Oriented Research: Directly addresses real-world AAC application requirements
- Large-Scale Human Evaluation: 200 crowdsourced participants ensure result reliability
- Statistical Rigor: ANOVA and paired t-tests confirm result significance
- Multi-Perspective Analysis: Comprehensive evaluation from alignment, accuracy, and authenticity dimensions
- Consistent Findings: Result trends align across both models
- Statistical Significance: Main conclusions pass statistical significance testing
- Practical Guidance Value: Provides clear design recommendations for AAC applications
- Evaluation Subjectivity: Reliance on human subjective judgment introduces potential bias
- Task Simplification: Keyword-to-sentence generation is relatively simple; actual AAC scenarios are more complex
- Static Assessment: Does not account for context dependency in dynamic dialogue
- Insufficient Participant Training: Rapid VAD concept training may be inadequate
- Limited Sample Size: Relatively few respondents per question (3-9 individuals)
- Model Version Differences: Model versions used may affect result timeliness
- Pioneering Work: First systematic investigation of LLM representation alignment
- Methodological Contribution: Representation alignment evaluation paradigm extensible to other conceptual domains
- Interdisciplinary Value: Bridges NLP, psychology, and assistive technology research
- AAC Tool Improvement: Provides guidance for affective representation design in AAC applications
- LLM Optimization Direction: Offers insights for enhancing LLM-human conceptual alignment
- Evaluation Standard Establishment: Establishes evaluation benchmarks for similar applications
- Detailed Method Description: Provides complete experimental setup and parameter configuration
- Open Data Commitment: Pledges to release experimental data and code
- Standardized Procedures: Establishes reproducible evaluation workflow
- AAC Tool Development: Design and optimization of affective expression functionality
- Dialogue Systems: Enhanced emotional understanding and expression capabilities
- Text Generation Evaluation: Establishes human-machine alignment evaluation standards
- Other Conceptual Alignments: Extension to values, cultural concepts, etc.
- Multimodal Alignment: Integration of visual, audio, and other multimodal information
- Personalized Adaptation: Customized alignment for specific user populations
This research cites extensive related work, primarily including:
- Demszky et al. (2020): GoEmotions emotion dataset
- Guo and Choi (2021): VAD emotion representation learning
- Valencia et al. (2023): AI language model applications in AAC
- Chen and Wan (2024): LLM lexical constraint generation capability evaluation
Overall Assessment: This is a high-quality research work making pioneering contributions to the important problem of LLM-human conceptual alignment. The research methodology is scientifically rigorous, experimental design is sound, and results possess significant theoretical and practical value. While certain limitations exist, the work establishes a solid foundation for future related research.