2025-11-10T02:37:56.044553

Joint Modeling of Big Five and HEXACO for Multimodal Apparent Personality-trait Recognition

Masumura, Orihashi, Ihori et al.

This paper proposes a joint modeling method of the Big Five, which has long been studied, and HEXACO, which has recently attracted attention in psychology, for automatically recognizing apparent personality traits from multimodal human behavior. Most previous studies have used the Big Five for multimodal apparent personality-trait recognition. However, no study has focused on apparent HEXACO which can evaluate an Honesty-Humility trait related to displaced aggression and vengefulness, social-dominance orientation, etc. In addition, the relationships between the Big Five and HEXACO when modeled by machine learning have not been clarified. We expect awareness of multimodal human behavior to improve by considering these relationships. The key advance of our proposed method is to optimize jointly recognizing the Big Five and HEXACO. Experiments using a self-introduction video dataset demonstrate that the proposed method can effectively recognize the Big Five and HEXACO.

academic

Joint Modeling of Big Five and HEXACO for Multimodal Apparent Personality-trait Recognition

Basic Information

Paper ID: 2510.14203
Title: Joint Modeling of Big Five and HEXACO for Multimodal Apparent Personality-trait Recognition
Authors: Ryo Masumura, Shota Orihashi, Mana Ihori, Tomohiro Tanaka, Naoki Makishima, Taiga Yamane, Naotaka Kawata, Satoshi Suzuki, Taichi Katayama (NTT, Inc., Japan)
Classification: cs.CV cs.CL cs.MM
Publication Date: October 16, 2025
Paper Link: https://arxiv.org/abs/2510.14203

Abstract

This paper proposes a joint modeling approach that combines the well-established Big Five personality traits with the recently emerging HEXACO personality traits for automatic recognition of apparent personality traits from multimodal human behavior. Previous research on multimodal apparent personality-trait recognition has primarily focused on the Big Five framework, with no studies addressing apparent HEXACO traits. However, HEXACO enables assessment of the Honesty-Humility dimension, which correlates with displaced aggression, revenge motivation, social dominance orientation, and related constructs. Furthermore, the relationship between Big Five and HEXACO in machine learning modeling remains unexplored. By considering these relationships, the authors expect to enhance perception capabilities for multimodal human behavior.

Research Background and Motivation

Problem Definition

Core Issue: Existing multimodal personality-trait recognition research primarily focuses on Big Five, with insufficient attention to HEXACO (particularly the Honesty-Humility dimension)
Significance: The Honesty-Humility trait in HEXACO shows strong negative correlations with displaced aggression, revenge motivation, social dominance orientation, and workplace misconduct, possessing important psychological implications
Existing Limitations:
- Lack of multimodal recognition research for apparent HEXACO traits
- The relationship between Big Five and HEXACO in machine learning modeling remains underexplored
- Existing datasets are primarily designed for Big Five

Research Motivation

By jointly modeling Big Five and HEXACO, leveraging the psychological relationships between the two personality frameworks, the study aims to enhance the robustness and accuracy of multimodal personality-trait recognition.

Core Contributions

First Study: The first research on multimodal apparent HEXACO personality-trait recognition
Joint Modeling Method: Proposes a joint modeling approach for Big Five and HEXACO that improves recognition performance for both frameworks
Relationship Exploration: First investigation of the relationship between Big Five and other personality traits (HEXACO) in multimodal apparent personality-trait recognition
Dataset Contribution: Constructs a self-introduction video dataset with annotations for both Big Five and HEXACO traits

Methodology Details

Task Definition

Given audio-visual video input, jointly estimate Big Five scores $\hat{y} = [\hat{y}_1, \cdots, \hat{y}_5]^⊤$ and HEXACO scores $\hat{z} = [\hat{z}_1, \cdots, \hat{z}_6]^⊤$ :

$\{\hat{y}, \hat{z}\} = F(S, U; \Theta)$

where $S$ represents audio features, $U$ represents visual features, and $\Theta$ represents the set of trainable parameters.

Model Architecture

Multimodal Transformer Architecture

The model comprises four encoders: audio encoder, text encoder, visual encoder, and multimodal encoder.

Feature Encoding:
- Audio encoder: $S \rightarrow A$ (audio representation)
- Text encoder: $W \rightarrow T$ (text representation, obtained via ASR)
- Visual encoder: $U \rightarrow V$ (visual representation)

Multimodal Fusion:

H₀ = TemporalConcat(A,T,V)  # Temporal concatenation
H'₀ = AddSegment(H₀; θ_segment)  # Add modality segmentation information
H = TransformerEnc(H'₀; θ_multi)  # Transformer encoding

Attentive Pooling:
```
h = AttentivePooling(H; θ_pool)
```

Joint Prediction Heads:

ẑ = Sigmoid(h; θᶻ_head)  # HEXACO prediction
ŷ = Sigmoid(h; θʸ_head)  # Big Five prediction

Training Strategy

Joint training using mean absolute error loss:

$L = \frac{1}{|D|}\sum_{d=1}^{|D|}|\hat{y}_d - y_d| + \frac{1}{|D|}\sum_{d=1}^{|D|}|\hat{z}_d - z_d|$

Technical Innovations

Joint Optimization: Simultaneously optimizes Big Five and HEXACO recognition, leveraging psychological relationships to enhance performance
Multimodal Fusion: Employs pre-trained Transformer architecture to process audio, visual, and textual information
Relationship Modeling: Learns latent relationships between Big Five and HEXACO through shared representation learning

Experimental Setup

Dataset

Scale: 10,100 self-introduction videos from 1,010 participants
Annotation: 200 observers annotated using 50-item Big Five questionnaire and 60-item HEXACO questionnaire
Split:
- Training set: 9,030 videos (903 participants)
- Validation set: 500 videos (50 participants)
- Test set: 570 videos (57 participants)
Video Specifications: Average duration 73.6 seconds, 1280×720 resolution, 25fps

Evaluation Metrics

Pearson Correlation Coefficient: Measures linear correlation between predicted and ground-truth values
Accuracy: Computed using ChaLearn First Impressions Challenge methodology: $\text{Accuracy}^k = 1 - \frac{1}{D}\sum_{d=1}^{D}|\hat{y}_d^k - y_d^k|$

Baseline Methods

Big Five-specific model
HEXACO-specific model
Joint model (proposed method)

Implementation Details

Audio Features: 80-dimensional log Mel filterbank coefficients, 10ms frame shift
Visual Features: CenterNet face detection, 128×128 cropping, 3fps downsampling
Pre-training: Audio encoder (20K hours Japanese speech), text encoder (100G tokens), visual encoder (RAF-DB and AffectNet)
Training: Batch size 8, dropout 0.1, RAdam optimizer, NVIDIA A6000 GPU

Experimental Results

Main Results

Big Five Recognition Performance

Modality Combination	Openness	Conscientiousness	Extraversion	Agreeableness	Neuroticism
Audio (Joint)	0.542/94.4	0.614/93.3	0.707/91.6	0.576/93.4	0.530/93.8
Audio+Visual+Text (Joint)	0.595/94.8	0.686/93.9	0.757/92.6	0.657/94.0	0.586/94.2
Human Evaluation	0.544/92.9	0.668/92.7	0.770/91.7	0.645/92.4	0.532/92.1

HEXACO Recognition Performance

Modality Combination	Honesty-Humility	Emotionality	Extraversion	Agreeableness	Conscientiousness	Openness
Audio (Joint)	0.482/95.2	0.639/95.6	0.660/92.9	0.469/94.0	0.549/94.1	0.454/93.7
Audio+Visual+Text (Joint)	0.504/95.2	0.645/95.6	0.707/93.2	0.576/94.3	0.579/94.2	0.608/94.4

Key Findings

Joint Modeling Advantages: The joint model outperforms task-specific models in most cases
Modality Contributions: Audio features are most effective, with visual features relatively effective for agreeableness recognition
Performance Comparison: Automatic recognition performance approaches human evaluation levels

Big Five and HEXACO Correlation Analysis

Experimental results demonstrate that the correlation patterns learned by the joint model align with psychological expectations. However, the model captures correlations excessively in certain traits, suggesting that while achieving human-level recognition performance, the model does not fully replicate human impression perception mechanisms.

Multimodal Personality-trait Recognition

Early research primarily employed hand-crafted features
Recent deep learning methods are widely applied, including deep residual networks and end-to-end approaches
Most research focuses on the Big Five framework

HEXACO Research

HEXACO serves as an alternative framework to Big Five, comprising six dimensions
The Honesty-Humility dimension negatively correlates with various negative behavioral factors
Previously, only one study inferred self-reported HEXACO traits from social media text

Conclusions and Discussion

Main Conclusions

Joint modeling of Big Five and HEXACO effectively enhances recognition performance for both frameworks
Multimodal information fusion is crucial for personality-trait recognition
Automatic recognition performance can achieve human evaluation levels

Limitations

Correlation Bias: The model excessively captures correlations between Big Five and HEXACO, failing to fully replicate human perception patterns
Data Limitations: The dataset contains only Japanese self-introduction videos, with generalization potential remaining to be verified
Cultural Differences: Does not account for personality-trait expression variations across different cultural backgrounds

Future Directions

Improve models to better replicate human perception of Big Five and HEXACO correlations
Extend to additional languages and cultural contexts
Explore joint modeling of other personality frameworks

In-Depth Evaluation

Strengths

Strong Innovation: First introduction of HEXACO to multimodal personality-trait recognition, filling a research gap
Reasonable Methodology: Joint modeling approach aligns with psychological theory with sound technical implementation
Comprehensive Experiments: Constructs large-scale annotated dataset with reasonable experimental design and convincing results
Practical Value: Achieves human evaluation levels with potential for real-world applications

Weaknesses

Limited Theoretical Depth: Lacks in-depth theoretical analysis of machine learning modeling for Big Five and HEXACO relationships
Generalization Concerns: Validation only on Japanese data; cross-linguistic and cross-cultural generalization remains unknown
Limited Interpretability: Model interpretability is constrained, making specific decision mechanisms difficult to understand

Impact

Academic Contribution: Opens new directions for multimodal personality computing and promotes interdisciplinary research
Practical Value: Applicable to human resources, educational assessment, and mental health domains
Data Contribution: Dual-annotated dataset provides important value for subsequent research

Application Scenarios

Human Resources: Personality assessment in recruitment interviews
Education: Personalized teaching and mental health monitoring for students
Social Media: User profiling and content recommendation
Mental Health: Auxiliary psychological diagnosis and treatment

References

The paper cites 36 relevant references spanning personality psychology theory, multimodal learning, deep learning, and other important works across multiple disciplines, providing a solid theoretical foundation for the research.

Overall Assessment: This is a high-quality interdisciplinary research paper with pioneering significance in multimodal personality computing. While there remains room for improvement in theoretical depth and generalization, its innovation and practical value make it an important contribution to the field.