Large language models (LLMs) are increasingly attracting the attention of healthcare professionals for their potential to assist in diagnostic assessments, which could alleviate the strain on the healthcare system caused by a high patient load and a shortage of providers. For LLMs to be effective in supporting diagnostic assessments, it is essential that they closely replicate the standard diagnostic procedures used by clinicians. In this paper, we specifically examine the diagnostic assessment processes described in the Patient Health Questionnaire-9 (PHQ-9) for major depressive disorder (MDD) and the Generalized Anxiety Disorder-7 (GAD-7) questionnaire for generalized anxiety disorder (GAD). We investigate various prompting and fine-tuning techniques to guide both proprietary and open-source LLMs in adhering to these processes, and we evaluate the agreement between LLM-generated diagnostic outcomes and expert-validated ground truth. For fine-tuning, we utilize the Mentalllama and Llama models, while for prompting, we experiment with proprietary models like GPT-3.5 and GPT-4o, as well as open-source models such as llama-3.1-8b and mixtral-8x7b.
Large Language Models for Mental Health Diagnostic Assessments: Exploring The Potential of Large Language Models for Assisting with Mental Health Diagnostic Assessments -- The Depression and Anxiety Case
- Paper ID: 2501.01305
- Title: Large Language Models for Mental Health Diagnostic Assessments: Exploring The Potential of Large Language Models for Assisting with Mental Health Diagnostic Assessments -- The Depression and Anxiety Case
- Authors: Kaushik Roy, Harshul Surana, Darssan Eswaramoorthi, Yuxin Zi, Vedant Palit, Ritvik Garimella, Amit Sheth
- Classification: cs.CL (Computation and Language)
- Publication Date: January 2, 2025 (arXiv preprint)
- Paper Link: https://arxiv.org/abs/2501.01305
- Institutions: University of South Carolina AI Institute, Indian Institute of Research and Science, Indian Institute of Technology
Large Language Models (LLMs) are increasingly attracting the attention of medical professionals for assisting diagnostic assessments, with the potential to alleviate healthcare system pressures caused by patient overload and healthcare provider shortages. For LLMs to play an effective role in supporting diagnostic assessments, they must be capable of closely replicating standard diagnostic procedures used by clinicians. This paper specifically investigates the diagnostic assessment process using the Patient Health Questionnaire-9 (PHQ-9) for Major Depressive Disorder (MDD) and the Generalized Anxiety Disorder-7 (GAD-7) questionnaire for Generalized Anxiety Disorder (GAD). The study explores various prompting and fine-tuning techniques to guide both proprietary and open-source LLMs to follow these diagnostic procedures, and evaluates the consistency between LLM-generated diagnostic results and expert-validated ground truth.
- Healthcare System Pressures: The current healthcare system faces dual pressures of patient overload and healthcare provider shortages
- Mental Health Diagnostic Needs: Increasing prevalence of mental health issues necessitates standardized diagnostic assessment tools
- Potential of LLMs in Healthcare: Large language models demonstrate superior performance in natural language processing tasks and show application potential in medical dialogue scenarios
- Standardized Diagnosis: PHQ-9 and GAD-7 are widely-used standardized assessment tools in clinical practice
- Automation Needs: Automating diagnostic assessments through LLMs can reduce clinician burden
- Consistency Requirements: LLMs must be capable of replicating standard diagnostic procedures used by clinicians for practical application
- Scoring Methods: Based solely on text relevance scoring, lacking deep understanding
- Explainable AI Methods: Using surrogate models such as LIME/SHAP, but with limited clinical interpretability
- Text Span Identification: Lacking specialized guidance for specific diagnostic criteria
- Novel Specialized Model: Proposes DiagnosticLlama, the first fine-tuned model based on Llama architecture specifically designed for diagnostic criteria assessment
- Comprehensive Evaluation Framework: Establishes a comprehensive evaluation system encompassing both prompting and fine-tuning methods
- High-Quality Dataset: Constructs an expert-validated synthetic dataset annotated by LLMs to facilitate related research
- Multi-Model Comparison: Systematically compares performance of proprietary models (GPT-3.5, GPT-4o) and open-source models (Llama-3.1-8b, Mixtral-8x7b)
- Standardized Methodology: Provides standardized approaches for applying LLMs to PHQ-9 and GAD-7 diagnostic assessments
Input: Social media post text (as a proxy for patient-clinician interaction)
Output: Text span identification and symptom presence judgment for each item in PHQ-9/GAD-7
Constraints: Must strictly adhere to standard diagnostic procedures of PHQ-9 and GAD-7
- Naive Prompting: Direct instructional prompts
- Few-Shot Prompting: Prompts providing a small number of examples
- Guided Prompting: Chain-of-Thought prompts with reasoning step guidance
- Base Model: MentalLlama (trained on 105K mental health instruction data)
- DiagnosticLlama: MentalLlama fine-tuned on the PRIMATE dataset using HuggingFace AutoTrain
- Base Data: PRIMATE dataset (social media posts + PHQ-9 annotations)
- GPT-4o Enhancement: Using GPT-4o to identify text spans corresponding to symptoms
- Expert Validation: Three clinical experts validate GPT-4o outputs (Cohen's Kappa: 0.74 for PHQ-9, 0.72 for GAD-7)
- Quality Control: Retaining only annotations consistently approved by experts
- Symptom-Specific Guidance: Specialized prompt templates designed for each symptom in PHQ-9 and GAD-7
- Multi-Level Evaluation: Dual evaluation system combining hits@k ranking and standard classification metrics
- Cross-Model Consistency: Validating method effectiveness across multiple LLMs of different scales and types
- Clinical Validation: Incorporating professional clinicians for quality verification to ensure clinical relevance
- PRIMATE Dataset: Contains social media posts with PHQ-9-related annotations
- Expert-Validated Subset:
- PHQ-9: 40 GPT-4o-annotated samples verified by experts
- GAD-7: 17 GPT-4o-annotated samples verified by experts
- Model-Annotated Data: Multi-model annotation results for a total of 1,034 posts
- Hits@k Ranking Metrics:
- hits@1: Hit rate when the most similar text span ranks first among ground truth
- hits@5: Hit rate when the most similar text span ranks within top 5 of ground truth
- Standard Classification Metrics: Accuracy, Precision, Recall, F1-score
- Proprietary Models: GPT-3.5-Turbo, GPT-4o-mini
- Open-Source Models: Llama-3.1-8b, Mixtral-8x7b
- Fine-Tuned Models: MentalLlama, DiagnosticLlama
- Traditional Methods: BERT, MentalBERT, MentalRoBERTa
- Machine Learning Methods: Logistic Regression, Random Forest, XGBoost
- Code-free fine-tuning using HuggingFace AutoTrain
- Identical prompt structure applied across all models to ensure fair comparison
- Random selection of test subsets due to budget and API constraints
Proprietary Model Performance:
| Model | hits@1 | hits@5 | Accuracy | Precision | Recall | F1-score |
|---|
| GPT-3.5-Turbo | 87% | 98% | 0.93 | 0.89 | 0.96 | 0.92 |
| GPT-4o-mini | 89% | 99% | 0.94 | 0.96 | 0.98 | 0.92 |
Open-Source Model Performance:
| Model | hits@1 | hits@5 | Accuracy | Precision | Recall | F1-score |
|---|
| Llama-3.1-8b | 83% | 88% | 0.84 | 0.86 | 0.78 | 0.82 |
| Mixtral-8x7b | 92% | 99% | 0.92 | 0.96 | 0.95 | 0.93 |
Fine-Tuned Model Performance:
| Model | hits@1 | hits@5 | Accuracy | Precision | Recall | F1-score |
|---|
| MentalLlama | - | - | 0.82 | 0.83 | 0.63 | 0.75 |
| DiagnosticLlama | 68.3% | 76.2% | - | - | - | - |
GAD-7 results demonstrate similar trends to PHQ-9, with both proprietary and open-source models approaching human annotation quality.
- Model Performance Disparities: Newer-generation LLMs significantly outperform older versions
- Llama2-7b-chat: F1=0.663
- Mistral-instruct: F1=0.655
- Fine-Tuning Challenges: Fine-tuning LLMs for specialized diagnostic tasks is highly challenging
- MentalLlama directly repeats input, demonstrating the importance of fine-tuning configuration
- DiagnosticLlama shows improvement but requires further optimization
- Traditional Method Comparison:
- BERT: F1=0.69
- MentalBERT: F1=0.71
- MentalRoBERTa: F1=0.48
- Traditional ML methods show poorer performance (highest XGBoost: F1=0.65)
The paper demonstrates through concrete examples how models identify text spans corresponding to PHQ-9 symptoms, such as identifying "I thought I set myself up for success. Now I believe I was dead wrong for joining" as corresponding to the symptom "feeling like a failure."
- Scoring Methods: Text scoring and ranking based on relevance to PHQ-9/GAD-7 symptoms
- Explainable AI Methods: Using LIME/SHAP techniques to clinically interpret BERT model outputs
- Text Span Identification: Predicting and summarizing text spans for comparison with manual annotations
- Specialized Guidance: Highly specialized model output guidance targeting specific diagnostic criteria
- Novelty: First diagnostic-specific fine-tuned model based on Llama architecture
- Systematicity: Provides systematic comparison of both prompting and fine-tuning methods
- Few-Shot Learning Effectiveness: LLMs in few-shot settings can approach the assessment quality of expert clinicians
- Reasoning Differences: Despite comparable results, LLM reasoning processes differ significantly from clinical reasoning
- Fine-Tuning Challenges: Fine-tuning LLMs for mental health diagnostic assistance still faces major technical challenges
- Practical Potential: The research provides a promising direction for alleviating healthcare system pressures
- Reasoning Consistency: Limited alignment between LLM and clinician reasoning processes
- Dataset Scale: Expert-validated ground truth dataset is relatively small
- Budget Constraints: API costs limit large-scale experimental validation
- Fine-Tuning Complexity: Fine-tuning requires substantial resources and hyperparameter tuning
- Clinical Applications: Develop applications for clinician use
- Extended Assessment: Extend DiagnosticLlama to GAD-7 and increase dataset size
- Complex Questionnaires: Support non-linear structured questionnaires (e.g., CSSRS)
- Safety Constraints: Integrate terminology restrictions and output rewriting to ensure safety
- Strong Clinical Relevance: Directly targets widely-used standardized assessment tools in clinical practice
- Comprehensive Methodology: Encompasses both prompting and fine-tuning mainstream methods
- Rigorous Evaluation: Incorporates professional clinician verification to ensure result credibility
- Open-Source Contribution: Provides models and datasets for community use
- Sufficient Experimentation: Systematic comparison across multiple models and metrics
- Dataset Scale: Expert-validated dataset is relatively small, potentially affecting generalizability of conclusions
- Domain Limitations: Covers only two conditions (depression and anxiety), with limited scope
- Reasoning Analysis: Insufficient analysis of differences between LLM and clinician reasoning processes
- Cost Considerations: Lacks cost-benefit analysis for practical deployment
- Ethical Discussion: Insufficient discussion of ethical issues in AI-assisted mental health diagnosis
- Academic Value: Provides important reference for LLM applications in mental health
- Practical Value: Provides technical foundation for healthcare institutions to deploy AI-assisted diagnostic systems
- Social Significance: Promises to alleviate mental health service resource shortages
- Reproducibility: Open-source code and datasets support research reproduction and extension
- Initial Screening: Suitable for large-scale mental health preliminary screening
- Diagnostic Assistance: Serves as an auxiliary tool rather than replacement for clinicians
- Telemedicine: Supports remote mental health services
- Research Tool: Provides automated analysis tools for mental health research
The paper cites 29 relevant references covering multiple related fields including LLMs, mental health assessment, prompt engineering, and fine-tuning techniques, providing a solid theoretical foundation for the research.
Overall Assessment: This is an important exploratory work in applying LLMs to mental health diagnostic assessment. The paper employs sound methodology, conducts sufficient experiments, and draws credible conclusions, making valuable contributions to this interdisciplinary field. Despite certain limitations, its pioneering significance and practical value make it an important reference for the field.