2025-11-11T14:16:09.100728

Evaluating Human-LLM Representation Alignment: A Case Study on Affective Sentence Generation for Augmentative and Alternative Communication

Choudhury, Kumar, Martin
Gaps arise between a language model's use of concepts and people's expectations. This gap is critical when LLMs generate text to help people communicate via Augmentative and Alternative Communication (AAC) tools. In this work, we introduce the evaluation task of Representation Alignment for measuring this gap via human judgment. In our study, we expand keywords and emotion representations into full sentences. We select four emotion representations: Words, Valence-Arousal-Dominance (VAD) dimensions expressed in both Lexical and Numeric forms, and Emojis. In addition to Representation Alignment, we also measure people's judgments of the accuracy and realism of the generated sentences. While representations like VAD break emotions into easy-to-compute components, our findings show that people agree more with how LLMs generate when conditioned on English words (e.g., "angry") rather than VAD scales. This difference is especially visible when comparing Numeric VAD to words. Furthermore, we found that the perception of how much a generated sentence conveys an emotion is dependent on both the representation type and which emotion it is.
academic

Evaluating Human-LLM Representation Alignment: A Case Study on Affective Sentence Generation for Augmentative and Alternative Communication

Basic Information

  • Paper ID: 2503.11881
  • Title: Evaluating Human-LLM Representation Alignment: A Case Study on Affective Sentence Generation for Augmentative and Alternative Communication
  • Authors: Shadab Choudhury, Asha Kumar, Lara J. Martin (University of Maryland, Baltimore County)
  • Classification: cs.CL (Computational Linguistics)
  • Publication Date: 2025
  • Paper Link: https://arxiv.org/abs/2503.11881

Abstract

This study addresses the gap between Large Language Models' (LLMs) conceptual usage and human expectations, particularly in the context of Augmentative and Alternative Communication (AAC) tools. The research introduces "Representation Alignment" as an evaluation task, measuring this gap through human judgment. Four affective representation modalities are examined: English lexical terms, lexicalized VAD dimensions, numeric VAD dimensions, and emojis, with evaluation of generated sentence accuracy and authenticity. Results demonstrate that humans show greater alignment with LLM outputs generated under English lexical conditions compared to VAD scales, with this discrepancy being particularly pronounced in numeric VAD versus lexical comparisons.

Research Background and Motivation

Problem Definition

  1. Core Issue: LLMs exhibit gaps in conceptual usage relative to human expectations, which is particularly critical in AAC tool applications
  2. Application Context: AAC tools assist individuals with speech impairments in communication, with communication speed being a primary pain point
  3. Technical Challenge: Ensuring that LLM-generated text accurately reflects users' affective intentions and expression preferences

Research Significance

  • AAC users frequently experience communication delays leading to being overlooked or interrupted
  • Current NLP technologies offer potential to enhance AAC tool communication speed
  • User concerns regarding LLM controllability, accuracy, and contextual appropriateness persist

Limitations of Existing Approaches

  • Lack of systematic evaluation of LLM-human alignment in conceptual understanding
  • Insufficient empirical evidence for selecting affective representation modalities
  • Inadequate consideration of different representation modalities' impact on user experience

Core Contributions

  1. Proposes Representation Alignment Evaluation Paradigm: Introduces an evaluation methodology measuring the alignment between LLM conceptual usage and human mental models through human judgment
  2. Systematically Compares Four Affective Representations: Comprehensively evaluates the effectiveness of Words, Lexical VAD, Numeric VAD, and Emojis representations
  3. Identifies Optimal Representation Modality: Demonstrates that English lexical terms and lexicalized VAD perform best in representation alignment, accuracy, and authenticity
  4. Provides AAC Application Guidance: Furnishes empirical evidence for selecting affective representations in future AAC applications

Methodology Details

Task Definition

  • Input: Three keywords + one affective representation
  • Output: Complete sentence containing keywords and expressing specified affect
  • Constraints: Generated sentences should be natural, accurately convey emotion, and avoid direct use of affective vocabulary

Affective Representation Modalities

1. Words Representation

Direct use of English affective vocabulary (e.g., "angry," "happy")

2. Lexical VAD Representation

Five-level lexical descriptions of VAD dimensions:

  • Valence: Very High/High/Moderate/Low/Very Low
  • Arousal: Degree of emotional activation
  • Dominance: Level of control over emotion

3. Numeric VAD Representation

Numeric scale from -5.0 to +5.0 representing VAD dimensions

4. Emojis Representation

Unicode emoji symbols representing emotions

Model Architecture and Generation Strategy

Models Used

  • GPT-4-Turbo-2024-04-09: Commercial API access
  • LLaMA-3.3-70B: 8-bit quantized version, locally deployed

Prompting Strategies

  • Words/Emojis: Few-shot prompting
  • VAD Representations: Step-back chain-of-thought prompting
  • Constraints: Prohibition of direct affective vocabulary use, requirement to "show rather than tell"

Data Generation

  • Total of 360 sentences per model (90 per representation modality)
  • Coverage of 18 distinct emotions from Demszky et al. (2020) classification
  • Two sentences per emotion randomly selected for evaluation

Experimental Setup

Dataset Construction

  • Emotion Selection: Based on Demszky et al. (2020) emotion classification, selecting 18 representative emotions
  • Keyword Combinations: Using common vocabulary combinations such as Place, Great, Korean, Finals, Semester, Math
  • VAD Numeric Values: Based on Guo and Choi (2021), normalized to -5.0 to +5.0 range

Human Evaluation Design

Participant Recruitment

  • Platform: Prolific crowdsourcing platform
  • Sample Size: 200 participants (100 per model)
  • Criteria: Age 18+, US-based, English fluent
  • Compensation: $14/hour, approximately 15-minute task

Evaluation Tasks

1. Representation Alignment Assessment
  • Display one affective representation and four generated sentences
  • Participants select the sentence best matching the emotion
  • Each participant answers 10 questions, randomly assigned
2. Accuracy and Authenticity Assessment
  • 5-point Likert scale evaluation:
    • "Convey": Degree to which sentence conveys emotion
    • "You'd say": Sounds like something the participant would say
    • "Someone Else'd say": Sounds like something others would say

Evaluation Metrics

Representation Alignment Metrics

  • Selection Rate: Percentage of times specific representation is selected
  • Shannon Entropy: Measures consistency of selections
  • Self-Alignment: Matching between generation and evaluation using same representation

Accuracy and Authenticity Metrics

  • Average Likert scores across three dimensions
  • ANOVA statistical significance testing
  • Paired t-tests for post-hoc analysis

Experimental Results

Primary Results

Representation Alignment Performance

RepresentationGPT-4 Selection RateLLaMA-3 Selection RateGPT-4 EntropyLLaMA-3 Entropy
Words61.9%57.5%0.320.42
Lexical VAD52.0%-0.610.72
Numeric VAD--0.700.63
Emojis--0.670.52

Key Findings

  1. Words Representation Optimal: Demonstrates highest self-alignment rates and lowest entropy values across both models
  2. Lexical VAD Secondary: Performs well on GPT-4 but shows diminished effectiveness on LLaMA-3
  3. Numeric VAD Poorest Performance: Highest entropy values, indicating participant difficulty reaching consensus
  4. Cross-Representation Alignment: Emojis and Lexical VAD show alignment on LLaMA-3

Accuracy and Authenticity Results

Statistical Significance

  • GPT-4: Affective representation significantly affects "Convey" and "You'd say" (p < 0.01)
  • LLaMA-3: Affective representation significantly affects "Convey" and "Someone Else'd say" (p < 0.05)

Pairwise Comparisons

  • Words significantly outperforms Numeric VAD on "Convey" dimension (GPT-4, p = 0.002)
  • Lexical VAD significantly outperforms Numeric VAD on "Convey" dimension (LLaMA-3, p = 0.018)
  • Words significantly outperforms Emojis (p = 0.005) and Numeric VAD (p = 0.044) on "You'd say" dimension

Emotion-Specific Analysis

Model Differences

  • GPT-4 markedly outperforms LLaMA-3 in generating "grateful" emotion sentences
  • Significant performance variations exist for different emotions across representation modalities
  • Certain emotions (e.g., "excited," "proud") show diminished performance under specific conditions

Representation Adaptability

  • Positive emotions typically perform better under Words representation
  • Complex emotional states are better suited to Lexical VAD representation
  • Numeric VAD encounters difficulties in fine-grained emotion differentiation

Ablation Studies

Keyword Adherence Analysis

Model1 Keyword2 Keywords3 KeywordsAverage Accuracy
GPT-4, 1x1.001.000.9360.978
LLaMA-3, 1x0.9080.8970.7810.862
LLaMA-3, 3x0.9690.9690.8500.930

VAD Training Effects

Providing participants with VAD concept explanations and practice questions improved comprehension accuracy, though cognitive load issues persist.

Keyword-Constrained Generation

  • Early grammar-based systems (Kasper, 1989; Uchimoto et al., 2002)
  • Sequence models and iterative refinement methods (Mou et al., 2016; He and Li, 2021)
  • Transformer-era controlled generation techniques (Kumar et al., 2021; Krause et al., 2021)

Emotion-Conditioned Sentence Generation

  • Early rule-based systems (Polzin and Waibel, 2000)
  • RNN-based conditional generation (Ghosh et al., 2017; Song et al., 2019)
  • LLM-era affective generation methods (Li et al., 2024; Mishra et al., 2023)

Value Alignment Research

  • Normative behavior learning in children's stories (Nahian et al., 2020)
  • Value integration in reinforcement learning from human feedback (Arzberger et al., 2024)
  • Value alignment measurement in existing models (Norhashim and Hahn, 2024)

Conclusions and Discussion

Main Conclusions

  1. Importance of Representation Alignment: The degree of alignment between human and LLM conceptual understanding directly impacts application effectiveness
  2. Superiority of Words Representation: English lexical terms provide the strongest alignment effect in affective representation
  3. Complexity of VAD Representation: Lexicalized VAD outperforms numeric VAD, but remains inferior to direct lexical representation
  4. Inter-Model Differences: Significant variations exist among different LLMs in emotion understanding and generation

Limitations

Technical Limitations

  1. Model Selection: Only two LLMs tested, with LLaMA-3 using 8-bit quantization
  2. Language Constraints: Limited to English; other languages may exhibit different results
  3. Participant Representativeness: Does not include actual AAC user populations

Methodological Limitations

  1. VAD Comprehension Burden: Participants require additional VAD concept learning, potentially affecting evaluation results
  2. Emoji Subjectivity: Cross-cultural differences in emoji interpretation exist
  3. Emotion Complexity: 18 emotions may not comprehensively cover the full emotional spectrum

Future Directions

  1. Expanded Model Coverage: Testing additional state-of-the-art LLM models
  2. Multilingual Validation: Verifying conclusions across other language environments
  3. User Personalization: Personalized representation learning for specific AAC user populations
  4. Real-Time Application: Deployment and evaluation in authentic AAC environments

In-Depth Evaluation

Strengths

Methodological Innovation

  1. Novel Representation Alignment Paradigm: Provides systematic methodology for evaluating LLM conceptual understanding
  2. Multi-Dimensional Evaluation Framework: Integrates alignment, accuracy, and authenticity assessment
  3. Practice-Oriented Research: Directly addresses real-world AAC application requirements

Experimental Rigor

  1. Large-Scale Human Evaluation: 200 crowdsourced participants ensure result reliability
  2. Statistical Rigor: ANOVA and paired t-tests confirm result significance
  3. Multi-Perspective Analysis: Comprehensive evaluation from alignment, accuracy, and authenticity dimensions

Result Convincingness

  1. Consistent Findings: Result trends align across both models
  2. Statistical Significance: Main conclusions pass statistical significance testing
  3. Practical Guidance Value: Provides clear design recommendations for AAC applications

Weaknesses

Methodological Limitations

  1. Evaluation Subjectivity: Reliance on human subjective judgment introduces potential bias
  2. Task Simplification: Keyword-to-sentence generation is relatively simple; actual AAC scenarios are more complex
  3. Static Assessment: Does not account for context dependency in dynamic dialogue

Experimental Design Flaws

  1. Insufficient Participant Training: Rapid VAD concept training may be inadequate
  2. Limited Sample Size: Relatively few respondents per question (3-9 individuals)
  3. Model Version Differences: Model versions used may affect result timeliness

Impact Assessment

Academic Contribution

  1. Pioneering Work: First systematic investigation of LLM representation alignment
  2. Methodological Contribution: Representation alignment evaluation paradigm extensible to other conceptual domains
  3. Interdisciplinary Value: Bridges NLP, psychology, and assistive technology research

Practical Value

  1. AAC Tool Improvement: Provides guidance for affective representation design in AAC applications
  2. LLM Optimization Direction: Offers insights for enhancing LLM-human conceptual alignment
  3. Evaluation Standard Establishment: Establishes evaluation benchmarks for similar applications

Reproducibility

  1. Detailed Method Description: Provides complete experimental setup and parameter configuration
  2. Open Data Commitment: Pledges to release experimental data and code
  3. Standardized Procedures: Establishes reproducible evaluation workflow

Applicable Scenarios

Direct Applications

  1. AAC Tool Development: Design and optimization of affective expression functionality
  2. Dialogue Systems: Enhanced emotional understanding and expression capabilities
  3. Text Generation Evaluation: Establishes human-machine alignment evaluation standards

Extended Applications

  1. Other Conceptual Alignments: Extension to values, cultural concepts, etc.
  2. Multimodal Alignment: Integration of visual, audio, and other multimodal information
  3. Personalized Adaptation: Customized alignment for specific user populations

References

This research cites extensive related work, primarily including:

  • Demszky et al. (2020): GoEmotions emotion dataset
  • Guo and Choi (2021): VAD emotion representation learning
  • Valencia et al. (2023): AI language model applications in AAC
  • Chen and Wan (2024): LLM lexical constraint generation capability evaluation

Overall Assessment: This is a high-quality research work making pioneering contributions to the important problem of LLM-human conceptual alignment. The research methodology is scientifically rigorous, experimental design is sound, and results possess significant theoretical and practical value. While certain limitations exist, the work establishes a solid foundation for future related research.