2025-11-10T02:37:56.044553

Joint Modeling of Big Five and HEXACO for Multimodal Apparent Personality-trait Recognition

Masumura, Orihashi, Ihori et al.
This paper proposes a joint modeling method of the Big Five, which has long been studied, and HEXACO, which has recently attracted attention in psychology, for automatically recognizing apparent personality traits from multimodal human behavior. Most previous studies have used the Big Five for multimodal apparent personality-trait recognition. However, no study has focused on apparent HEXACO which can evaluate an Honesty-Humility trait related to displaced aggression and vengefulness, social-dominance orientation, etc. In addition, the relationships between the Big Five and HEXACO when modeled by machine learning have not been clarified. We expect awareness of multimodal human behavior to improve by considering these relationships. The key advance of our proposed method is to optimize jointly recognizing the Big Five and HEXACO. Experiments using a self-introduction video dataset demonstrate that the proposed method can effectively recognize the Big Five and HEXACO.
academic

Joint Modeling of Big Five and HEXACO for Multimodal Apparent Personality-trait Recognition

Basic Information

  • Paper ID: 2510.14203
  • Title: Joint Modeling of Big Five and HEXACO for Multimodal Apparent Personality-trait Recognition
  • Authors: Ryo Masumura, Shota Orihashi, Mana Ihori, Tomohiro Tanaka, Naoki Makishima, Taiga Yamane, Naotaka Kawata, Satoshi Suzuki, Taichi Katayama (NTT, Inc., Japan)
  • Classification: cs.CV cs.CL cs.MM
  • Publication Date: October 16, 2025
  • Paper Link: https://arxiv.org/abs/2510.14203

Abstract

This paper proposes a joint modeling approach that combines the well-established Big Five personality traits with the recently emerging HEXACO personality traits for automatic recognition of apparent personality traits from multimodal human behavior. Previous research on multimodal apparent personality-trait recognition has primarily focused on the Big Five framework, with no studies addressing apparent HEXACO traits. However, HEXACO enables assessment of the Honesty-Humility dimension, which correlates with displaced aggression, revenge motivation, social dominance orientation, and related constructs. Furthermore, the relationship between Big Five and HEXACO in machine learning modeling remains unexplored. By considering these relationships, the authors expect to enhance perception capabilities for multimodal human behavior.

Research Background and Motivation

Problem Definition

  1. Core Issue: Existing multimodal personality-trait recognition research primarily focuses on Big Five, with insufficient attention to HEXACO (particularly the Honesty-Humility dimension)
  2. Significance: The Honesty-Humility trait in HEXACO shows strong negative correlations with displaced aggression, revenge motivation, social dominance orientation, and workplace misconduct, possessing important psychological implications
  3. Existing Limitations:
    • Lack of multimodal recognition research for apparent HEXACO traits
    • The relationship between Big Five and HEXACO in machine learning modeling remains underexplored
    • Existing datasets are primarily designed for Big Five

Research Motivation

By jointly modeling Big Five and HEXACO, leveraging the psychological relationships between the two personality frameworks, the study aims to enhance the robustness and accuracy of multimodal personality-trait recognition.

Core Contributions

  1. First Study: The first research on multimodal apparent HEXACO personality-trait recognition
  2. Joint Modeling Method: Proposes a joint modeling approach for Big Five and HEXACO that improves recognition performance for both frameworks
  3. Relationship Exploration: First investigation of the relationship between Big Five and other personality traits (HEXACO) in multimodal apparent personality-trait recognition
  4. Dataset Contribution: Constructs a self-introduction video dataset with annotations for both Big Five and HEXACO traits

Methodology Details

Task Definition

Given audio-visual video input, jointly estimate Big Five scores y^=[y^1,,y^5]\hat{y} = [\hat{y}_1, \cdots, \hat{y}_5]^⊤ and HEXACO scores z^=[z^1,,z^6]\hat{z} = [\hat{z}_1, \cdots, \hat{z}_6]^⊤:

{y^,z^}=F(S,U;Θ)\{\hat{y}, \hat{z}\} = F(S, U; \Theta)

where SS represents audio features, UU represents visual features, and Θ\Theta represents the set of trainable parameters.

Model Architecture

Multimodal Transformer Architecture

The model comprises four encoders: audio encoder, text encoder, visual encoder, and multimodal encoder.

  1. Feature Encoding:
    • Audio encoder: SAS \rightarrow A (audio representation)
    • Text encoder: WTW \rightarrow T (text representation, obtained via ASR)
    • Visual encoder: UVU \rightarrow V (visual representation)
  2. Multimodal Fusion:
    H₀ = TemporalConcat(A,T,V)  # Temporal concatenation
    H'₀ = AddSegment(H₀; θ_segment)  # Add modality segmentation information
    H = TransformerEnc(H'₀; θ_multi)  # Transformer encoding
    
  3. Attentive Pooling:
    h = AttentivePooling(H; θ_pool)
    
  4. Joint Prediction Heads:
    ẑ = Sigmoid(h; θᶻ_head)  # HEXACO prediction
    ŷ = Sigmoid(h; θʸ_head)  # Big Five prediction
    

Training Strategy

Joint training using mean absolute error loss:

L=1Dd=1Dy^dyd+1Dd=1Dz^dzdL = \frac{1}{|D|}\sum_{d=1}^{|D|}|\hat{y}_d - y_d| + \frac{1}{|D|}\sum_{d=1}^{|D|}|\hat{z}_d - z_d|

Technical Innovations

  1. Joint Optimization: Simultaneously optimizes Big Five and HEXACO recognition, leveraging psychological relationships to enhance performance
  2. Multimodal Fusion: Employs pre-trained Transformer architecture to process audio, visual, and textual information
  3. Relationship Modeling: Learns latent relationships between Big Five and HEXACO through shared representation learning

Experimental Setup

Dataset

  • Scale: 10,100 self-introduction videos from 1,010 participants
  • Annotation: 200 observers annotated using 50-item Big Five questionnaire and 60-item HEXACO questionnaire
  • Split:
    • Training set: 9,030 videos (903 participants)
    • Validation set: 500 videos (50 participants)
    • Test set: 570 videos (57 participants)
  • Video Specifications: Average duration 73.6 seconds, 1280×720 resolution, 25fps

Evaluation Metrics

  1. Pearson Correlation Coefficient: Measures linear correlation between predicted and ground-truth values
  2. Accuracy: Computed using ChaLearn First Impressions Challenge methodology: Accuracyk=11Dd=1Dy^dkydk\text{Accuracy}^k = 1 - \frac{1}{D}\sum_{d=1}^{D}|\hat{y}_d^k - y_d^k|

Baseline Methods

  • Big Five-specific model
  • HEXACO-specific model
  • Joint model (proposed method)

Implementation Details

  • Audio Features: 80-dimensional log Mel filterbank coefficients, 10ms frame shift
  • Visual Features: CenterNet face detection, 128×128 cropping, 3fps downsampling
  • Pre-training: Audio encoder (20K hours Japanese speech), text encoder (100G tokens), visual encoder (RAF-DB and AffectNet)
  • Training: Batch size 8, dropout 0.1, RAdam optimizer, NVIDIA A6000 GPU

Experimental Results

Main Results

Big Five Recognition Performance

Modality CombinationOpennessConscientiousnessExtraversionAgreeablenessNeuroticism
Audio (Joint)0.542/94.40.614/93.30.707/91.60.576/93.40.530/93.8
Audio+Visual+Text (Joint)0.595/94.80.686/93.90.757/92.60.657/94.00.586/94.2
Human Evaluation0.544/92.90.668/92.70.770/91.70.645/92.40.532/92.1

HEXACO Recognition Performance

Modality CombinationHonesty-HumilityEmotionalityExtraversionAgreeablenessConscientiousnessOpenness
Audio (Joint)0.482/95.20.639/95.60.660/92.90.469/94.00.549/94.10.454/93.7
Audio+Visual+Text (Joint)0.504/95.20.645/95.60.707/93.20.576/94.30.579/94.20.608/94.4

Key Findings

  1. Joint Modeling Advantages: The joint model outperforms task-specific models in most cases
  2. Modality Contributions: Audio features are most effective, with visual features relatively effective for agreeableness recognition
  3. Performance Comparison: Automatic recognition performance approaches human evaluation levels

Big Five and HEXACO Correlation Analysis

Experimental results demonstrate that the correlation patterns learned by the joint model align with psychological expectations. However, the model captures correlations excessively in certain traits, suggesting that while achieving human-level recognition performance, the model does not fully replicate human impression perception mechanisms.

Multimodal Personality-trait Recognition

  • Early research primarily employed hand-crafted features
  • Recent deep learning methods are widely applied, including deep residual networks and end-to-end approaches
  • Most research focuses on the Big Five framework

HEXACO Research

  • HEXACO serves as an alternative framework to Big Five, comprising six dimensions
  • The Honesty-Humility dimension negatively correlates with various negative behavioral factors
  • Previously, only one study inferred self-reported HEXACO traits from social media text

Conclusions and Discussion

Main Conclusions

  1. Joint modeling of Big Five and HEXACO effectively enhances recognition performance for both frameworks
  2. Multimodal information fusion is crucial for personality-trait recognition
  3. Automatic recognition performance can achieve human evaluation levels

Limitations

  1. Correlation Bias: The model excessively captures correlations between Big Five and HEXACO, failing to fully replicate human perception patterns
  2. Data Limitations: The dataset contains only Japanese self-introduction videos, with generalization potential remaining to be verified
  3. Cultural Differences: Does not account for personality-trait expression variations across different cultural backgrounds

Future Directions

  1. Improve models to better replicate human perception of Big Five and HEXACO correlations
  2. Extend to additional languages and cultural contexts
  3. Explore joint modeling of other personality frameworks

In-Depth Evaluation

Strengths

  1. Strong Innovation: First introduction of HEXACO to multimodal personality-trait recognition, filling a research gap
  2. Reasonable Methodology: Joint modeling approach aligns with psychological theory with sound technical implementation
  3. Comprehensive Experiments: Constructs large-scale annotated dataset with reasonable experimental design and convincing results
  4. Practical Value: Achieves human evaluation levels with potential for real-world applications

Weaknesses

  1. Limited Theoretical Depth: Lacks in-depth theoretical analysis of machine learning modeling for Big Five and HEXACO relationships
  2. Generalization Concerns: Validation only on Japanese data; cross-linguistic and cross-cultural generalization remains unknown
  3. Limited Interpretability: Model interpretability is constrained, making specific decision mechanisms difficult to understand

Impact

  1. Academic Contribution: Opens new directions for multimodal personality computing and promotes interdisciplinary research
  2. Practical Value: Applicable to human resources, educational assessment, and mental health domains
  3. Data Contribution: Dual-annotated dataset provides important value for subsequent research

Application Scenarios

  1. Human Resources: Personality assessment in recruitment interviews
  2. Education: Personalized teaching and mental health monitoring for students
  3. Social Media: User profiling and content recommendation
  4. Mental Health: Auxiliary psychological diagnosis and treatment

References

The paper cites 36 relevant references spanning personality psychology theory, multimodal learning, deep learning, and other important works across multiple disciplines, providing a solid theoretical foundation for the research.


Overall Assessment: This is a high-quality interdisciplinary research paper with pioneering significance in multimodal personality computing. While there remains room for improvement in theoretical depth and generalization, its innovation and practical value make it an important contribution to the field.