2025-11-15T23:04:12.069621

GPT-4 on Clinic Depression Assessment: An LLM-Based Pilot Study

Lorenzoni, Velmovitsky, Alencar et al.

Depression has impacted millions of people worldwide and has become one of the most prevalent mental disorders. Early mental disorder detection can lead to cost savings for public health agencies and avoid the onset of other major comorbidities. Additionally, the shortage of specialized personnel is a critical issue because clinical depression diagnosis is highly dependent on expert professionals and is time consuming. In this study, we explore the use of GPT-4 for clinical depression assessment based on transcript analysis. We examine the model's ability to classify patient interviews into binary categories: depressed and not depressed. A comparative analysis is conducted considering prompt complexity (e.g., using both simple and complex prompts) as well as varied temperature settings to assess the impact of prompt complexity and randomness on the model's performance. Results indicate that GPT-4 exhibits considerable variability in accuracy and F1-Score across configurations, with optimal performance observed at lower temperature values (0.0-0.2) for complex prompts. However, beyond a certain threshold (temperature >= 0.3), the relationship between randomness and performance becomes unpredictable, diminishing the gains from prompt complexity. These findings suggest that, while GPT-4 shows promise for clinical assessment, the configuration of the prompts and model parameters requires careful calibration to ensure consistent results. This preliminary study contributes to understanding the dynamics between prompt engineering and large language models, offering insights for future development of AI-powered tools in clinical settings.

academic

GPT-4 on Clinical Depression Assessment: An LLM-Based Pilot Study

Basic Information

Paper ID: 2501.00199
Title: GPT-4 on Clinical Depression Assessment: An LLM-Based Pilot Study
Authors: Giuliano Lorenzoni, Pedro Elkind Velmovitsky, Paulo Alencar, Donald Cowan
Classification: cs.CL (Computational Linguistics), cs.AI (Artificial Intelligence)
Publication Date: December 31, 2024 (arXiv preprint)
Paper Link: https://arxiv.org/abs/2501.00199

Abstract

Depression has affected millions of people globally, becoming one of the most prevalent mental disorders. Early detection of mental illness can reduce costs for public health institutions and prevent other serious complications. Furthermore, the shortage of mental health professionals is a critical issue, as clinical depression diagnosis is highly dependent on expert assessment and is time-consuming.

This study explores the use of GPT-4 for clinical depression assessment based on interview transcripts. The research examines the model's ability to classify patient interviews into binary categories (depressed and non-depressed). Comparative analysis is conducted by considering prompt complexity (simple and complex prompts) and different temperature settings to evaluate the impact of prompt complexity and randomness on model performance.

Results demonstrate significant variability in accuracy and F1 scores across different configurations, with optimal performance observed at lower temperature values (0.0-0.2) with complex prompts. However, beyond a certain threshold (temperature ≥0.3), the relationship between randomness and performance becomes unpredictable, diminishing the benefits of prompt complexity.

Research Background and Motivation

Problem Definition

The core problem addressed in this study is how to leverage the large language model GPT-4 to assist in clinical depression diagnosis, particularly through analyzing patient interview transcripts for binary classification (depressed/non-depressed).

Problem Significance

Global Health Burden: Depression is one of the most prevalent mental disorders globally, affecting millions of people
Value of Early Detection: Early identification can significantly reduce healthcare costs and prevent serious complications
Resource Scarcity: Severe shortage of mental health professionals, with diagnosis processes dependent on experts and time-consuming
Technological Opportunity: Development of large language models provides new possibilities for automating mental health assessment

Limitations of Existing Methods

Traditional Machine Learning Approaches: Primarily use SVM, TextCNN, and other methods with limited application on the DAIC-WOZ dataset
Feature Engineering Dependency: Requires manual feature extraction, lacking end-to-end automation capability
Insufficient LLM Application: While some research uses LLMs for depression detection, systematic studies on prompt engineering and parameter optimization are lacking

Research Motivation

By systematically investigating GPT-4's application in clinical depression assessment, particularly focusing on how prompt engineering strategies and model parameters (such as temperature) affect performance, this study aims to provide empirical foundations for AI-assisted mental health diagnosis.

Core Contributions

First systematic study of GPT-4's application in clinical depression binary classification tasks, conducting comprehensive evaluation based on the DAIC-WOZ dataset
Proposes progressive prompt engineering strategies, progressing from simple prompts to complex prompts to example-enhanced approaches, systematically analyzing the impact of different complexity levels on performance
In-depth analysis of temperature parameter's impact on model stability and performance, discovering the optimal temperature range of 0.0-0.2
Reveals non-linear relationship between prompt complexity and randomness, providing guidance for parameter optimization in clinical AI applications
Provides practical configuration strategies for AI-assisted mental health diagnosis, emphasizing the importance of reducing false negatives in clinical environments

Methodology Details

Task Definition

Input: Transcribed text from patient interviews (from DAIC-WOZ dataset) Output: Binary classification result ("depressed" or "not depressed") Constraints: Standardized diagnostic criteria based on PHQ-8 scale

Experimental Design Architecture

This study employs a five-stage progressive experimental design:

RQ1: Simple Prompt Baseline

Uses the most basic classification prompt without any context or examples, serving as a performance baseline.

RQ2: Example-Enhanced Prompts

Adds four examples (two depression cases, two non-depression cases) to the simple prompt, employing few-shot learning strategy.

RQ3: Complex Prompt Design

Combines examples with detailed clinical context, simulating the analytical perspective of professional psychopathologists, providing richer guidance information.

RQ4: Temperature Parameter Optimization

Systematically tests the impact of different temperature values (0.0, 0.1, 0.2, 0.3, 0.5) on model performance.

RQ5: Stability Analysis

Analyzes the impact of output variability on GPT-4's reliability for clinical diagnosis.

Technical Innovations

Progressive Prompt Complexity Design: Systematic prompt engineering methodology progressing from simple to complex
Temperature-Performance Relationship Modeling: First systematic study of temperature parameter's role in clinical classification tasks
Clinically-Oriented Evaluation Framework: Focuses on reducing false negatives, aligning with clinical practice requirements
Training-Free Direct Inference: Entirely based on pre-trained model's zero-shot and few-shot capabilities

Experimental Setup

Dataset

DAIC-WOZ (Distress Analysis Interview Corpus - Wizard-of-Oz)

Scale: 189 interview sessions, with 184-188 actually used (slight variation due to data processing issues)
Annotation: Based on PHQ-8 scale, 56 depression cases, approximately 130 non-depression cases
Data Type: Interview transcripts
Data Distribution: Approximately 30% depression cases, 70% non-depression cases (imbalanced dataset)

Evaluation Metrics

Accuracy: Overall classification correctness rate
Precision: Proportion of true positives among predicted positives
Recall: Proportion of actual positives correctly identified
F1 Score: Harmonic mean of precision and recall
Confusion Matrix: Detailed distribution of classification results

Implementation Details

API Interface: OpenAI GPT-4 API
Programming Environment: Python + Pandas + NumPy + scikit-learn + Matplotlib/Seaborn
Temperature Range: 0.0 to 0.5, with 0.1 intervals
Example Selection: Balanced selection of two positive and two negative cases

Experimental Results

Main Results

RQ1: Simple Prompt Baseline Results

Metric	Value
Accuracy	70.74%
Precision	54.55%
Recall	10.71%
F1 Score	17.91%

Confusion Matrix: 127 true negatives, 5 false positives, 50 false negatives, 6 true positives

RQ2: Example-Enhanced Prompt Results

Metric	Value
Accuracy	70.49%
Precision	50.00%
Recall	77.78%
F1 Score	60.87%

Key Finding: Recall significantly improved to 77.78%, F1 score jumped from 17.91% to 60.87%

RQ3: Complex Prompt Results

Metric	Value
Accuracy	69.23%
Precision	48.39%
Recall	55.56%
F1 Score	51.72%

Unexpected Finding: Complex prompt performance actually decreased, possibly due to excessive randomness introduced by default temperature settings

RQ4: Temperature Optimization Results

Temperature	Accuracy	Precision	Recall	F1 Score
0.0	72.28%	51.95%	74.07%	61.07%
0.1	73.37%	53.09%	79.63%	63.70%
0.2	71.74%	51.16%	81.48%	62.86%
0.3	67.93%	46.67%	64.81%	54.26%
0.5	68.48%	47.56%	72.22%	57.35%

Key Experimental Findings

Optimal Temperature Range: The 0.0-0.2 interval shows the best performance, with temperature 0.1 achieving the highest accuracy of 73.37% and F1 score of 63.70%
Non-linear Temperature-Performance Relationship: Performance significantly decreases at temperature ≥0.3, exhibiting unpredictable fluctuations
Significant Example Learning Effect: Few-shot learning improved F1 score from 17.91% to 60.87%
Complexity Paradox: Overly complex prompts at default temperature actually reduce performance
Clinical Metric Optimization: Low temperature settings effectively balance sensitivity and specificity

Ablation Study Analysis

Through progressive experimental design, the contribution of each component becomes clear:

Basic Classification Capability: Simple prompts already possess certain classification ability (70.74% accuracy)
Example Learning Gains: Few-shot learning significantly improves recall (from 10.71% to 77.78%)
Temperature Optimization Value: Appropriate temperature settings further optimize performance balance
Complexity Cost: Over-engineered prompts may introduce noise

Traditional Machine Learning Methods

Existing research primarily employs SVM, TextCNN, and other traditional ML methods on the DAIC-WOZ dataset for depression detection, focusing on speech features and text sentiment analysis, but lacking end-to-end automation capability.

LLM Applications in Mental Health

E-DAIC Research: Uses LLM to predict PHQ-8 scores, achieving 3.65 mean absolute error
Cross-Domain LLM Applications: Demonstrates potential in sentiment analysis and classification tasks in finance, software engineering, and other fields

Relative Advantages of This Study

Systematic Prompt Engineering: First systematic study of prompt complexity's impact on clinical classification
Parameter Sensitivity Analysis: In-depth analysis of temperature parameter's impact on stability
Clinically-Oriented Design: Focuses on reducing false negatives, aligning with clinical practice requirements

Conclusions and Discussion

Main Conclusions

GPT-4 Demonstrates Potential for Clinical Depression Classification: Can achieve 73.37% accuracy and 63.70% F1 score under appropriate configuration
Prompt Engineering Strategies Are Effective: Example enhancement significantly improves performance, particularly recall
Temperature Parameter Is Critical: Low temperature range of 0.0-0.2 provides optimal stability and performance balance
Complexity Requires Careful Balancing: Overly complex prompts may introduce unnecessary variability
Clinical Application Requires Fine-Tuning: Parameter configuration significantly impacts consistency and reliability

Limitations

Dataset Scale Limitations: Only 189 samples, which may affect result generalizability
Data Imbalance Issues: 30% depression rate is much higher than real-world prevalence, potentially introducing bias
Single Data Source: Only uses DAIC-WOZ dataset, lacking cross-dataset validation
Randomness Impact: Model's inherent randomness may affect result consistency
Lack of Professional Validation: No comparison with clinical expert diagnoses

Future Directions

Retrieval-Augmented Generation (RAG): Integrate external medical knowledge bases to improve diagnostic accuracy
Domain-Specific Fine-Tuning: Specialized training using clinical data
Multimodal Fusion: Incorporate multiple modalities including speech and video
Variability Control Strategies: Explore methods for aggregating results from multiple runs
Large-Scale Clinical Validation: Verify on larger and more diverse clinical datasets

In-Depth Evaluation

Strengths

Rigorous Research Design: Progressive experimental design clearly demonstrates the impact of various factors
High Practical Value: Provides practical guidance for AI-assisted mental health diagnosis
In-Depth Parameter Analysis: Systematic analysis of temperature parameter's impact on performance
Clear Clinical Orientation: Emphasizes reducing false negatives, aligning with clinical practice
Transparent and Detailed Results: Provides detailed confusion matrices and performance metrics

Weaknesses

Limited Sample Size: 189 samples are relatively limited for deep learning research
Lack of Statistical Significance Testing: Does not report statistical significance of results
Insufficient Randomness Control: Does not employ multiple runs averaging to control random variation
Limited Baseline Comparisons: Lacks comparison with other LLMs or traditional methods
Missing Clinical Validation: No comparison with actual clinical expert diagnoses

Impact

Academic Contribution: Provides important reference for LLM applications in mental health
Practical Value: Offers configuration strategy guidance for clinical AI tool development
Methodological Value: Prompt engineering and parameter optimization methods are generalizable to other clinical tasks
Policy Impact: Provides empirical support for regulation and standard-setting in AI-assisted healthcare

Applicable Scenarios

Clinical Auxiliary Diagnosis: As a supplementary tool for mental health specialists
Large-Scale Screening: Initial screening in resource-limited regions
Telemedicine: Supporting online mental health services
Research Tool: Data preprocessing for large-scale mental health research

References

The paper cites 20 relevant references, covering:

Research related to the DAIC-WOZ dataset
Traditional machine learning applications in depression detection
LLM classification and generation tasks across various domains
Standardized mental health assessment tools (PHQ-8)

Overall Assessment: This is a high-quality preliminary study that systematically explores the potential of GPT-4 in clinical depression assessment. The research design is sound, experimental results are valuable, and it makes important contributions to the field of AI-assisted mental health diagnosis. Despite limitations in sample size and validation, it provides a solid foundation for subsequent research.