Machine learning models for speech-based depression classification offer promise for health care applications. Despite growing work on depression classification, little is understood about how the length of speech-input impacts model performance. We analyze results for speaker-independent depression classification using a corpus of over 1400 hours of speech from a human-machine health screening application. We examine performance as a function of response input length for two NLP systems that differ in overall performance.
Results for both systems show that performance depends on natural length, elapsed length, and ordering of the response within a session. Systems share a minimum length threshold, but differ in a response saturation threshold, with the latter higher for the better system. At saturation it is better to pose a new question to the speaker, than to continue the current response. These and additional reported results suggest how applications can be better designed to both elicit and process optimal input lengths for depression classification.
- Paper ID: 2501.00608
- Title: Optimizing Speech-Input Length for Speaker-Independent Depression Classification
- Authors: Tomasz Rutowski, Amir Harati, Yang Lu, Elizabeth Shriberg (Ellipsis Health, Inc.)
- Classification: cs.CL eess.AS
- Keywords: depression, speech, paralinguistics, affective computing, NLP, health applications, deep learning
This paper investigates the impact of speech input length on the performance of machine learning-based depression classification. Using a large-scale corpus containing over 1,400 hours of speech data, the study analyzes the performance of two NLP systems with different performance levels across varying response input lengths. Results demonstrate that system performance depends on natural length, elapsed time, and the sequence of responses within a session. Both systems share a minimum length threshold, but differ in response saturation thresholds, with the better-performing system exhibiting a higher saturation threshold.
Depression is a prevalent disabling condition and a major global public health issue. Mobile AI technology plays an important role in expanding depression screening, particularly as an auxiliary tool for healthcare providers. Speech technology shows promise due to its naturalness, remote usability, lack of requirement for specialized training, and its capacity to carry information about the speaker's state.
- Practical Need: Despite growing research on speech-based depression classification, there is limited understanding of how speech input length affects model performance
- Practical Considerations: Longer inputs increase patient time costs and system infrastructure costs
- Optimization Requirement: Need to find optimal balance between performance and efficiency
- The first-order assumption that "more speech is better" in most speech technology tasks lacks thorough validation
- Absence of systematic research on the relationship between input length and classification performance
- Insufficient consideration of time and cost constraints in practical applications
- Large-Scale Data Analysis: Systematic analysis using a corpus of over 1,400 hours of speech data
- Multi-Level Length Effect Study: Analysis of length effects at both individual response and multi-response session levels
- Cross-System Comparison: Comparison of two NLP systems with different performance levels to validate the generality of length thresholds
- Practical Design Principles: Specific recommendations for designing and optimizing depression classification applications
- Unexpected Findings: Revelation of patterns showing speakers increase speech length progressively throughout sessions
- Input: Spontaneous American English speech, free-form user responses to questions on various topics
- Output: Binary classification task (depressed/non-depressed) based on PHQ-8 scores (≥10 indicates depression)
- Constraint: Speaker-independent classification task
- Scale: 1,400 hours of speech, 9,600 independent users
- Structure: Each session contains 4-6 question responses (average 4.52), with each response averaging 125 words
- Annotation: PHQ-8 scale used as gold standard (PHQ-9 with suicidality question removed)
- Split: No overlapping speakers between training and test sets
- Method: SVM + word embeddings
- Features: Word2Vec vectors with average pooling
- Data: Smaller training set (650 hours, 6,600 users)
- Vocabulary: 7,000 tokens
- Method: Deep learning model based on ULMFiT
- Architecture: RNN-LSTM language model, pretrained on large-scale public corpora (e.g., Wikipedia) and fine-tuned
- Data: Complete training set (1,400 hours, 9,600 users)
- Vocabulary: 30,000 tokens
- Cumulative Gated Length Metric: Definition of a new length assessment method showing the amount of information "so far" at any point
- Multi-Dimensional Length Analysis: Simultaneous consideration of natural length, elapsed time, and sequence order within sessions
- Cross-System Threshold Comparison: Validation of finding generality through comparison of systems with different performance levels
| Dataset | Total Responses | Training (-dep) | Training (+dep) | Test (-dep) | Test (+dep) |
|---|
| Smaller (650h) | 32,078 | 12,966 | 4,602 | 11,366 | 3,144 |
| Larger (1400h) | 64,518 | 35,715 | 14,293 | 11,366 | 3,144 |
- Primary Metric: AUC (Area Under Curve), suitable for binary tasks and imbalanced class distributions
- Secondary Metrics: Specificity and sensitivity for medical domain evaluation
- Transcription: Google Async ASR
- Speech Rate Estimation: Global average speech rate 2.39 words/second (143.4 words/minute)
- Depression-Related Speech Rate Reduction: Depressed group speech rate approximately 5 words/minute lower than non-depressed group, consistent with literature
- Length-Related Speech Rate Reduction: Longer responses universally show slower speech rates, with differences of approximately 3-4 words/minute
- Minor Effect: Overall differences are small, allowing use of global speech rate estimation
- Minimum Length Threshold: Both systems show sharp performance decline below 30-50 words
- Response Saturation Point: Individual responses saturate at approximately 250 words AUC
- Session Saturation Point: Session-level saturation occurs at approximately 1,000 words
- System 2 consistently outperforms System 1
- Session-level performance exceeds single response performance
- Both systems surpass unaided primary care physician performance (87% specificity/54% sensitivity)
- Consistent Minimum Threshold: Regardless of response count, session minimum threshold is 30-50 words
- Diminishing Returns: Benefit of N+1 responses compared to N responses decreases as N increases
- Multi-Response Advantage: Given fixed length, more responses outperform fewer responses
- New Response Benefit: Maximum benefit of starting a new response is approximately 4% AUC
- Early Response Saturation: System 2 saturates at 200 words (System 1 at 120 words)
- Length Increment Pattern: Speakers tend to gradually increase response length throughout sessions
- Short-Long Response Performance Crossover: Long responses ultimately perform better, but short responses perform better initially
- Within-Response Thresholds: Existence of threshold lengths below which current responses should not be interrupted
- System 1: 80 words (continuation threshold) and 120 words (saturation threshold)
- System 2: 150 words (continuation threshold) and 200 words (saturation threshold)
- Optimal Session Length: Approximately 8 minutes total speech (1,000 words)
- Value of Second Half of Response: 6% AUC higher than first half
- Cross-System Performance Difference: Better system more effectively utilizes additional vocabulary
The paper cites 34 related works covering depression detection, speech emotion computing, and multimodal assessment, with particular mention of the AVEC challenge series advancing the field. Compared to existing work, this paper focuses on the practical yet overlooked problem of input length.
- Length Thresholds Exist: Clear minimum and saturation length thresholds are present
- System Dependency: Better systems have higher saturation thresholds and better utilize additional information
- Session Strategy: Multiple short responses outperform fewer long responses
- Real-Time Application Guidance: Can provide real-time guidance on when to continue, switch questions, or end sessions
- Data Specificity: Specific length and speech rate values may vary across different datasets, languages, and age groups
- Task Specificity: Results primarily apply to depression classification tasks
- Technology Dependency: Based on specific ASR and NLP technologies
- Cross-Language Validation: Validate findings across different languages and cultural contexts
- Real-Time System Development: Develop adaptive systems that optimize length in real-time
- Multi-Task Extension: Extend findings to other mental health classification tasks
- High Practical Value: Directly addresses critical problems in practical applications
- Large Data Scale: Uses one of the largest datasets in the field to date
- Systematic Methodology: Multi-dimensional and multi-level analytical approach
- Meaningful Findings: Reveals interesting patterns in speaker behavior
- Strong Application Guidance: Provides specific design recommendations
- Limited Technical Innovation: Primarily analytical research with relatively conventional technical methods
- Generalization Validity Pending: Cross-domain generalization capability requires further verification
- Insufficient Theoretical Explanation: Lacks in-depth theoretical explanation of observed phenomena
- Field Contribution: Fills gap in input length research for speech-based depression detection
- Practical Value: Provides important design guidance for real-world system deployment
- Reproducibility: Clear methodology; discussions ongoing with Linguistic Data Consortium regarding data release
- Speech-based mental health screening applications
- Telemedicine and digital health platforms
- Optimization of human-computer dialogue systems
- Speech emotion computing research
The paper cites 34 related references covering depression detection, speech processing, deep learning, and other relevant domains, providing a solid theoretical foundation for the research.
Overall Assessment: This is a research paper with significant practical value. While technical innovation is relatively limited, it addresses critical problems in practical applications and provides valuable guidance for designing and optimizing speech-based depression detection systems. The research methodology is systematic, the data scale is large, and the conclusions are practical, making it highly meaningful for advancing practical applications in this field.