2025-11-23T03:58:16.399198

Optimizing Speech-Input Length for Speaker-Independent Depression Classification

Rutowski, Harati, Lu et al.

Machine learning models for speech-based depression classification offer promise for health care applications. Despite growing work on depression classification, little is understood about how the length of speech-input impacts model performance. We analyze results for speaker-independent depression classification using a corpus of over 1400 hours of speech from a human-machine health screening application. We examine performance as a function of response input length for two NLP systems that differ in overall performance. Results for both systems show that performance depends on natural length, elapsed length, and ordering of the response within a session. Systems share a minimum length threshold, but differ in a response saturation threshold, with the latter higher for the better system. At saturation it is better to pose a new question to the speaker, than to continue the current response. These and additional reported results suggest how applications can be better designed to both elicit and process optimal input lengths for depression classification.

academic

Optimizing Speech-Input Length for Speaker-Independent Depression Classification

Basic Information

Paper ID: 2501.00608
Title: Optimizing Speech-Input Length for Speaker-Independent Depression Classification
Authors: Tomasz Rutowski, Amir Harati, Yang Lu, Elizabeth Shriberg (Ellipsis Health, Inc.)
Classification: cs.CL eess.AS
Keywords: depression, speech, paralinguistics, affective computing, NLP, health applications, deep learning

Abstract

This paper investigates the impact of speech input length on the performance of machine learning-based depression classification. Using a large-scale corpus containing over 1,400 hours of speech data, the study analyzes the performance of two NLP systems with different performance levels across varying response input lengths. Results demonstrate that system performance depends on natural length, elapsed time, and the sequence of responses within a session. Both systems share a minimum length threshold, but differ in response saturation thresholds, with the better-performing system exhibiting a higher saturation threshold.

Research Background and Motivation

Problem Definition

Depression is a prevalent disabling condition and a major global public health issue. Mobile AI technology plays an important role in expanding depression screening, particularly as an auxiliary tool for healthcare providers. Speech technology shows promise due to its naturalness, remote usability, lack of requirement for specialized training, and its capacity to carry information about the speaker's state.

Research Motivation

Practical Need: Despite growing research on speech-based depression classification, there is limited understanding of how speech input length affects model performance
Practical Considerations: Longer inputs increase patient time costs and system infrastructure costs
Optimization Requirement: Need to find optimal balance between performance and efficiency

Limitations of Existing Approaches

The first-order assumption that "more speech is better" in most speech technology tasks lacks thorough validation
Absence of systematic research on the relationship between input length and classification performance
Insufficient consideration of time and cost constraints in practical applications

Core Contributions

Large-Scale Data Analysis: Systematic analysis using a corpus of over 1,400 hours of speech data
Multi-Level Length Effect Study: Analysis of length effects at both individual response and multi-response session levels
Cross-System Comparison: Comparison of two NLP systems with different performance levels to validate the generality of length thresholds
Practical Design Principles: Specific recommendations for designing and optimizing depression classification applications
Unexpected Findings: Revelation of patterns showing speakers increase speech length progressively throughout sessions

Methodology Details

Task Definition

Input: Spontaneous American English speech, free-form user responses to questions on various topics
Output: Binary classification task (depressed/non-depressed) based on PHQ-8 scores (≥10 indicates depression)
Constraint: Speaker-independent classification task

Dataset Construction

Scale: 1,400 hours of speech, 9,600 independent users
Structure: Each session contains 4-6 question responses (average 4.52), with each response averaging 125 words
Annotation: PHQ-8 scale used as gold standard (PHQ-9 with suicidality question removed)
Split: No overlapping speakers between training and test sets

Model Architecture

System 1 (Weaker System)

Method: SVM + word embeddings
Features: Word2Vec vectors with average pooling
Data: Smaller training set (650 hours, 6,600 users)
Vocabulary: 7,000 tokens

System 2 (Stronger System)

Method: Deep learning model based on ULMFiT
Architecture: RNN-LSTM language model, pretrained on large-scale public corpora (e.g., Wikipedia) and fine-tuned
Data: Complete training set (1,400 hours, 9,600 users)
Vocabulary: 30,000 tokens

Technical Innovations

Cumulative Gated Length Metric: Definition of a new length assessment method showing the amount of information "so far" at any point
Multi-Dimensional Length Analysis: Simultaneous consideration of natural length, elapsed time, and sequence order within sessions
Cross-System Threshold Comparison: Validation of finding generality through comparison of systems with different performance levels

Experimental Setup

Dataset Details

Dataset	Total Responses	Training (-dep)	Training (+dep)	Test (-dep)	Test (+dep)
Smaller (650h)	32,078	12,966	4,602	11,366	3,144
Larger (1400h)	64,518	35,715	14,293	11,366	3,144

Evaluation Metrics

Primary Metric: AUC (Area Under Curve), suitable for binary tasks and imbalanced class distributions
Secondary Metrics: Specificity and sensitivity for medical domain evaluation

Speech Processing

Transcription: Google Async ASR
Speech Rate Estimation: Global average speech rate 2.39 words/second (143.4 words/minute)

Experimental Results

Speech Rate Analysis Findings

Depression-Related Speech Rate Reduction: Depressed group speech rate approximately 5 words/minute lower than non-depressed group, consistent with literature
Length-Related Speech Rate Reduction: Longer responses universally show slower speech rates, with differences of approximately 3-4 words/minute
Minor Effect: Overall differences are small, allowing use of global speech rate estimation

Aggregated Length Effects

Main Findings

Minimum Length Threshold: Both systems show sharp performance decline below 30-50 words
Response Saturation Point: Individual responses saturate at approximately 250 words AUC
Session Saturation Point: Session-level saturation occurs at approximately 1,000 words

System Performance Comparison

System 2 consistently outperforms System 1
Session-level performance exceeds single response performance
Both systems surpass unaided primary care physician performance (87% specificity/54% sensitivity)

Within-Session Length Effects

Response Accumulation Effects

Consistent Minimum Threshold: Regardless of response count, session minimum threshold is 30-50 words
Diminishing Returns: Benefit of N+1 responses compared to N responses decreases as N increases
Multi-Response Advantage: Given fixed length, more responses outperform fewer responses
New Response Benefit: Maximum benefit of starting a new response is approximately 4% AUC
Early Response Saturation: System 2 saturates at 200 words (System 1 at 120 words)

Unexpected Findings

Length Increment Pattern: Speakers tend to gradually increase response length throughout sessions
Short-Long Response Performance Crossover: Long responses ultimately perform better, but short responses perform better initially
Within-Response Thresholds: Existence of threshold lengths below which current responses should not be interrupted
- System 1: 80 words (continuation threshold) and 120 words (saturation threshold)
- System 2: 150 words (continuation threshold) and 200 words (saturation threshold)

Key Numerical Results

Optimal Session Length: Approximately 8 minutes total speech (1,000 words)
Value of Second Half of Response: 6% AUC higher than first half
Cross-System Performance Difference: Better system more effectively utilizes additional vocabulary

The paper cites 34 related works covering depression detection, speech emotion computing, and multimodal assessment, with particular mention of the AVEC challenge series advancing the field. Compared to existing work, this paper focuses on the practical yet overlooked problem of input length.

Conclusions and Discussion

Main Conclusions

Length Thresholds Exist: Clear minimum and saturation length thresholds are present
System Dependency: Better systems have higher saturation thresholds and better utilize additional information
Session Strategy: Multiple short responses outperform fewer long responses
Real-Time Application Guidance: Can provide real-time guidance on when to continue, switch questions, or end sessions

Limitations

Data Specificity: Specific length and speech rate values may vary across different datasets, languages, and age groups
Task Specificity: Results primarily apply to depression classification tasks
Technology Dependency: Based on specific ASR and NLP technologies

Future Directions

Cross-Language Validation: Validate findings across different languages and cultural contexts
Real-Time System Development: Develop adaptive systems that optimize length in real-time
Multi-Task Extension: Extend findings to other mental health classification tasks

In-Depth Evaluation

Strengths

High Practical Value: Directly addresses critical problems in practical applications
Large Data Scale: Uses one of the largest datasets in the field to date
Systematic Methodology: Multi-dimensional and multi-level analytical approach
Meaningful Findings: Reveals interesting patterns in speaker behavior
Strong Application Guidance: Provides specific design recommendations

Limitations

Limited Technical Innovation: Primarily analytical research with relatively conventional technical methods
Generalization Validity Pending: Cross-domain generalization capability requires further verification
Insufficient Theoretical Explanation: Lacks in-depth theoretical explanation of observed phenomena

Impact

Field Contribution: Fills gap in input length research for speech-based depression detection
Practical Value: Provides important design guidance for real-world system deployment
Reproducibility: Clear methodology; discussions ongoing with Linguistic Data Consortium regarding data release

Applicable Scenarios

Speech-based mental health screening applications
Telemedicine and digital health platforms
Optimization of human-computer dialogue systems
Speech emotion computing research

References

The paper cites 34 related references covering depression detection, speech processing, deep learning, and other relevant domains, providing a solid theoretical foundation for the research.

Overall Assessment: This is a research paper with significant practical value. While technical innovation is relatively limited, it addresses critical problems in practical applications and provides valuable guidance for designing and optimizing speech-based depression detection systems. The research methodology is systematic, the data scale is large, and the conclusions are practical, making it highly meaningful for advancing practical applications in this field.