2025-11-15T23:04:12.069621

GPT-4 on Clinic Depression Assessment: An LLM-Based Pilot Study

Lorenzoni, Velmovitsky, Alencar et al.
Depression has impacted millions of people worldwide and has become one of the most prevalent mental disorders. Early mental disorder detection can lead to cost savings for public health agencies and avoid the onset of other major comorbidities. Additionally, the shortage of specialized personnel is a critical issue because clinical depression diagnosis is highly dependent on expert professionals and is time consuming. In this study, we explore the use of GPT-4 for clinical depression assessment based on transcript analysis. We examine the model's ability to classify patient interviews into binary categories: depressed and not depressed. A comparative analysis is conducted considering prompt complexity (e.g., using both simple and complex prompts) as well as varied temperature settings to assess the impact of prompt complexity and randomness on the model's performance. Results indicate that GPT-4 exhibits considerable variability in accuracy and F1-Score across configurations, with optimal performance observed at lower temperature values (0.0-0.2) for complex prompts. However, beyond a certain threshold (temperature >= 0.3), the relationship between randomness and performance becomes unpredictable, diminishing the gains from prompt complexity. These findings suggest that, while GPT-4 shows promise for clinical assessment, the configuration of the prompts and model parameters requires careful calibration to ensure consistent results. This preliminary study contributes to understanding the dynamics between prompt engineering and large language models, offering insights for future development of AI-powered tools in clinical settings.
academic

GPT-4 on Clinical Depression Assessment: An LLM-Based Pilot Study

Basic Information

  • Paper ID: 2501.00199
  • Title: GPT-4 on Clinical Depression Assessment: An LLM-Based Pilot Study
  • Authors: Giuliano Lorenzoni, Pedro Elkind Velmovitsky, Paulo Alencar, Donald Cowan
  • Classification: cs.CL (Computational Linguistics), cs.AI (Artificial Intelligence)
  • Publication Date: December 31, 2024 (arXiv preprint)
  • Paper Link: https://arxiv.org/abs/2501.00199

Abstract

Depression has affected millions of people globally, becoming one of the most prevalent mental disorders. Early detection of mental illness can reduce costs for public health institutions and prevent other serious complications. Furthermore, the shortage of mental health professionals is a critical issue, as clinical depression diagnosis is highly dependent on expert assessment and is time-consuming.

This study explores the use of GPT-4 for clinical depression assessment based on interview transcripts. The research examines the model's ability to classify patient interviews into binary categories (depressed and non-depressed). Comparative analysis is conducted by considering prompt complexity (simple and complex prompts) and different temperature settings to evaluate the impact of prompt complexity and randomness on model performance.

Results demonstrate significant variability in accuracy and F1 scores across different configurations, with optimal performance observed at lower temperature values (0.0-0.2) with complex prompts. However, beyond a certain threshold (temperature ≥0.3), the relationship between randomness and performance becomes unpredictable, diminishing the benefits of prompt complexity.

Research Background and Motivation

Problem Definition

The core problem addressed in this study is how to leverage the large language model GPT-4 to assist in clinical depression diagnosis, particularly through analyzing patient interview transcripts for binary classification (depressed/non-depressed).

Problem Significance

  1. Global Health Burden: Depression is one of the most prevalent mental disorders globally, affecting millions of people
  2. Value of Early Detection: Early identification can significantly reduce healthcare costs and prevent serious complications
  3. Resource Scarcity: Severe shortage of mental health professionals, with diagnosis processes dependent on experts and time-consuming
  4. Technological Opportunity: Development of large language models provides new possibilities for automating mental health assessment

Limitations of Existing Methods

  1. Traditional Machine Learning Approaches: Primarily use SVM, TextCNN, and other methods with limited application on the DAIC-WOZ dataset
  2. Feature Engineering Dependency: Requires manual feature extraction, lacking end-to-end automation capability
  3. Insufficient LLM Application: While some research uses LLMs for depression detection, systematic studies on prompt engineering and parameter optimization are lacking

Research Motivation

By systematically investigating GPT-4's application in clinical depression assessment, particularly focusing on how prompt engineering strategies and model parameters (such as temperature) affect performance, this study aims to provide empirical foundations for AI-assisted mental health diagnosis.

Core Contributions

  1. First systematic study of GPT-4's application in clinical depression binary classification tasks, conducting comprehensive evaluation based on the DAIC-WOZ dataset
  2. Proposes progressive prompt engineering strategies, progressing from simple prompts to complex prompts to example-enhanced approaches, systematically analyzing the impact of different complexity levels on performance
  3. In-depth analysis of temperature parameter's impact on model stability and performance, discovering the optimal temperature range of 0.0-0.2
  4. Reveals non-linear relationship between prompt complexity and randomness, providing guidance for parameter optimization in clinical AI applications
  5. Provides practical configuration strategies for AI-assisted mental health diagnosis, emphasizing the importance of reducing false negatives in clinical environments

Methodology Details

Task Definition

Input: Transcribed text from patient interviews (from DAIC-WOZ dataset) Output: Binary classification result ("depressed" or "not depressed") Constraints: Standardized diagnostic criteria based on PHQ-8 scale

Experimental Design Architecture

This study employs a five-stage progressive experimental design:

RQ1: Simple Prompt Baseline

Uses the most basic classification prompt without any context or examples, serving as a performance baseline.

RQ2: Example-Enhanced Prompts

Adds four examples (two depression cases, two non-depression cases) to the simple prompt, employing few-shot learning strategy.

RQ3: Complex Prompt Design

Combines examples with detailed clinical context, simulating the analytical perspective of professional psychopathologists, providing richer guidance information.

RQ4: Temperature Parameter Optimization

Systematically tests the impact of different temperature values (0.0, 0.1, 0.2, 0.3, 0.5) on model performance.

RQ5: Stability Analysis

Analyzes the impact of output variability on GPT-4's reliability for clinical diagnosis.

Technical Innovations

  1. Progressive Prompt Complexity Design: Systematic prompt engineering methodology progressing from simple to complex
  2. Temperature-Performance Relationship Modeling: First systematic study of temperature parameter's role in clinical classification tasks
  3. Clinically-Oriented Evaluation Framework: Focuses on reducing false negatives, aligning with clinical practice requirements
  4. Training-Free Direct Inference: Entirely based on pre-trained model's zero-shot and few-shot capabilities

Experimental Setup

Dataset

DAIC-WOZ (Distress Analysis Interview Corpus - Wizard-of-Oz)

  • Scale: 189 interview sessions, with 184-188 actually used (slight variation due to data processing issues)
  • Annotation: Based on PHQ-8 scale, 56 depression cases, approximately 130 non-depression cases
  • Data Type: Interview transcripts
  • Data Distribution: Approximately 30% depression cases, 70% non-depression cases (imbalanced dataset)

Evaluation Metrics

  • Accuracy: Overall classification correctness rate
  • Precision: Proportion of true positives among predicted positives
  • Recall: Proportion of actual positives correctly identified
  • F1 Score: Harmonic mean of precision and recall
  • Confusion Matrix: Detailed distribution of classification results

Implementation Details

  • API Interface: OpenAI GPT-4 API
  • Programming Environment: Python + Pandas + NumPy + scikit-learn + Matplotlib/Seaborn
  • Temperature Range: 0.0 to 0.5, with 0.1 intervals
  • Example Selection: Balanced selection of two positive and two negative cases

Experimental Results

Main Results

RQ1: Simple Prompt Baseline Results

MetricValue
Accuracy70.74%
Precision54.55%
Recall10.71%
F1 Score17.91%

Confusion Matrix: 127 true negatives, 5 false positives, 50 false negatives, 6 true positives

RQ2: Example-Enhanced Prompt Results

MetricValue
Accuracy70.49%
Precision50.00%
Recall77.78%
F1 Score60.87%

Key Finding: Recall significantly improved to 77.78%, F1 score jumped from 17.91% to 60.87%

RQ3: Complex Prompt Results

MetricValue
Accuracy69.23%
Precision48.39%
Recall55.56%
F1 Score51.72%

Unexpected Finding: Complex prompt performance actually decreased, possibly due to excessive randomness introduced by default temperature settings

RQ4: Temperature Optimization Results

TemperatureAccuracyPrecisionRecallF1 Score
0.072.28%51.95%74.07%61.07%
0.173.37%53.09%79.63%63.70%
0.271.74%51.16%81.48%62.86%
0.367.93%46.67%64.81%54.26%
0.568.48%47.56%72.22%57.35%

Key Experimental Findings

  1. Optimal Temperature Range: The 0.0-0.2 interval shows the best performance, with temperature 0.1 achieving the highest accuracy of 73.37% and F1 score of 63.70%
  2. Non-linear Temperature-Performance Relationship: Performance significantly decreases at temperature ≥0.3, exhibiting unpredictable fluctuations
  3. Significant Example Learning Effect: Few-shot learning improved F1 score from 17.91% to 60.87%
  4. Complexity Paradox: Overly complex prompts at default temperature actually reduce performance
  5. Clinical Metric Optimization: Low temperature settings effectively balance sensitivity and specificity

Ablation Study Analysis

Through progressive experimental design, the contribution of each component becomes clear:

  • Basic Classification Capability: Simple prompts already possess certain classification ability (70.74% accuracy)
  • Example Learning Gains: Few-shot learning significantly improves recall (from 10.71% to 77.78%)
  • Temperature Optimization Value: Appropriate temperature settings further optimize performance balance
  • Complexity Cost: Over-engineered prompts may introduce noise

Traditional Machine Learning Methods

Existing research primarily employs SVM, TextCNN, and other traditional ML methods on the DAIC-WOZ dataset for depression detection, focusing on speech features and text sentiment analysis, but lacking end-to-end automation capability.

LLM Applications in Mental Health

  • E-DAIC Research: Uses LLM to predict PHQ-8 scores, achieving 3.65 mean absolute error
  • Cross-Domain LLM Applications: Demonstrates potential in sentiment analysis and classification tasks in finance, software engineering, and other fields

Relative Advantages of This Study

  1. Systematic Prompt Engineering: First systematic study of prompt complexity's impact on clinical classification
  2. Parameter Sensitivity Analysis: In-depth analysis of temperature parameter's impact on stability
  3. Clinically-Oriented Design: Focuses on reducing false negatives, aligning with clinical practice requirements

Conclusions and Discussion

Main Conclusions

  1. GPT-4 Demonstrates Potential for Clinical Depression Classification: Can achieve 73.37% accuracy and 63.70% F1 score under appropriate configuration
  2. Prompt Engineering Strategies Are Effective: Example enhancement significantly improves performance, particularly recall
  3. Temperature Parameter Is Critical: Low temperature range of 0.0-0.2 provides optimal stability and performance balance
  4. Complexity Requires Careful Balancing: Overly complex prompts may introduce unnecessary variability
  5. Clinical Application Requires Fine-Tuning: Parameter configuration significantly impacts consistency and reliability

Limitations

  1. Dataset Scale Limitations: Only 189 samples, which may affect result generalizability
  2. Data Imbalance Issues: 30% depression rate is much higher than real-world prevalence, potentially introducing bias
  3. Single Data Source: Only uses DAIC-WOZ dataset, lacking cross-dataset validation
  4. Randomness Impact: Model's inherent randomness may affect result consistency
  5. Lack of Professional Validation: No comparison with clinical expert diagnoses

Future Directions

  1. Retrieval-Augmented Generation (RAG): Integrate external medical knowledge bases to improve diagnostic accuracy
  2. Domain-Specific Fine-Tuning: Specialized training using clinical data
  3. Multimodal Fusion: Incorporate multiple modalities including speech and video
  4. Variability Control Strategies: Explore methods for aggregating results from multiple runs
  5. Large-Scale Clinical Validation: Verify on larger and more diverse clinical datasets

In-Depth Evaluation

Strengths

  1. Rigorous Research Design: Progressive experimental design clearly demonstrates the impact of various factors
  2. High Practical Value: Provides practical guidance for AI-assisted mental health diagnosis
  3. In-Depth Parameter Analysis: Systematic analysis of temperature parameter's impact on performance
  4. Clear Clinical Orientation: Emphasizes reducing false negatives, aligning with clinical practice
  5. Transparent and Detailed Results: Provides detailed confusion matrices and performance metrics

Weaknesses

  1. Limited Sample Size: 189 samples are relatively limited for deep learning research
  2. Lack of Statistical Significance Testing: Does not report statistical significance of results
  3. Insufficient Randomness Control: Does not employ multiple runs averaging to control random variation
  4. Limited Baseline Comparisons: Lacks comparison with other LLMs or traditional methods
  5. Missing Clinical Validation: No comparison with actual clinical expert diagnoses

Impact

  1. Academic Contribution: Provides important reference for LLM applications in mental health
  2. Practical Value: Offers configuration strategy guidance for clinical AI tool development
  3. Methodological Value: Prompt engineering and parameter optimization methods are generalizable to other clinical tasks
  4. Policy Impact: Provides empirical support for regulation and standard-setting in AI-assisted healthcare

Applicable Scenarios

  1. Clinical Auxiliary Diagnosis: As a supplementary tool for mental health specialists
  2. Large-Scale Screening: Initial screening in resource-limited regions
  3. Telemedicine: Supporting online mental health services
  4. Research Tool: Data preprocessing for large-scale mental health research

References

The paper cites 20 relevant references, covering:

  • Research related to the DAIC-WOZ dataset
  • Traditional machine learning applications in depression detection
  • LLM classification and generation tasks across various domains
  • Standardized mental health assessment tools (PHQ-8)

Overall Assessment: This is a high-quality preliminary study that systematically explores the potential of GPT-4 in clinical depression assessment. The research design is sound, experimental results are valuable, and it makes important contributions to the field of AI-assisted mental health diagnosis. Despite limitations in sample size and validation, it provides a solid foundation for subsequent research.