2025-11-17T15:52:13.050530

An AI-Based Behavioral Health Safety Filter and Dataset for Identifying Mental Health Crises in Text-Based Conversations

Nelson, Wong, Silvestrini et al.

Large language models often mishandle psychiatric emergencies, offering harmful or inappropriate advice and enabling destructive behaviors. This study evaluated the Verily behavioral health safety filter (VBHSF) on two datasets: the Verily Mental Health Crisis Dataset containing 1,800 simulated messages and the NVIDIA Aegis AI Content Safety Dataset subsetted to 794 mental health-related messages. The two datasets were clinician-labelled and we evaluated performance using the clinician labels. Additionally, we carried out comparative performance analyses against two open source, content moderation guardrails: OpenAI Omni Moderation Latest and NVIDIA NeMo Guardrails. The VBHSF demonstrated, well-balanced performance on the Verily Mental Health Crisis Dataset v1.0, achieving high sensitivity (0.990) and specificity (0.992) in detecting any mental health crises. It achieved an F1-score of 0.939, sensitivity ranged from 0.917-0.992, and specificity was >= 0.978 in identifying specific crisis categories. When evaluated against the NVIDIA Aegis AI Content Safety Dataset 2.0, VBHSF performance remained highly sensitive (0.982) and accuracy (0.921) with reduced specificity (0.859). When compared with the NVIDIA NeMo and OpenAI Omni Moderation Latest guardrails, the VBHSF demonstrated superior performance metrics across both datasets, achieving significantly higher sensitivity in all cases (all p < 0.001) and higher specificity relative to NVIDIA NeMo (p < 0.001), but not to OpenAI Omni Moderation Latest (p = 0.094). NVIDIA NeMo and OpenAI Omni Moderation Latest exhibited inconsistent performance across specific crisis types, with sensitivity for some categories falling below 0.10. Overall, the VBHSF demonstrated robust, generalizable performance that prioritizes sensitivity to minimize missed crises, a crucial feature for healthcare applications.

academic

An AI-Based Behavioral Health Safety Filter and Dataset for Identifying Mental Health Crises in Text-Based Conversations

基本信息

论文ID: 2510.12083
标题: An AI-Based Behavioral Health Safety Filter and Dataset for Identifying Mental Health Crises in Text-Based Conversations
作者: Benjamin W. Nelson, Celeste Wong, Matthew T. Silvestrini, Sooyoon Shin, Alanna Robinson, Jessica Lee, Eric Yang, John Torous, Andrew Trister
分类: cs.CL cs.AI
发表时间: 未明确标注期刊或会议，为预印本论文
论文链接: https://arxiv.org/abs/2510.12083

摘要

大语言模型在处理精神健康危机时经常出现错误，提供有害或不当建议，甚至促成破坏性行为。本研究在两个数据集上评估了Verily行为健康安全过滤器(VBHSF)：包含1,800条模拟消息的Verily心理健康危机数据集和包含794条心理健康相关消息的NVIDIA Aegis AI内容安全数据集子集。两个数据集均经过临床医师标注。研究还与两个开源内容审核防护栏进行了比较性能分析：OpenAI Omni Moderation Latest和NVIDIA NeMo Guardrails。VBHSF在Verily心理健康危机数据集v1.0上表现出色，在检测任何心理健康危机方面达到了高敏感性(0.990)和特异性(0.992)。在识别特定危机类别方面，F1分数为0.939，敏感性范围为0.917-0.992，特异性≥0.978。在NVIDIA Aegis AI内容安全数据集2.0上评估时，VBHSF保持了高敏感性(0.982)和准确率(0.921)，但特异性有所降低(0.859)。与现有防护栏相比，VBHSF在所有情况下都显示出显著更高的敏感性(均p < 0.001)，相对于NVIDIA NeMo具有更高的特异性(p < 0.001)，但与OpenAI Omni Moderation Latest无显著差异(p = 0.094)。

研究背景与动机

问题定义

精神健康危机的识别和处理是一个日益严峻的社会问题。研究背景表明：

精神健康危机普遍且上升：精神健康急诊日益普遍且呈上升趋势
检测困难：即使是临床医师在危机检测方面也仅略好于随机猜测
表达间接性：个体通常以间接方式表达痛苦

现有技术局限性

当前大语言模型在精神健康危机处理方面存在严重缺陷：

高风险失误：包括错过自杀警告信号、提供不安全建议，甚至促成伤害
通用性防护栏不足：现有安全过滤器主要针对一般性风险(如性内容、一般暴力)，不适用于精神健康危机检测
缺乏临床验证：现有基准数据集缺乏心理健康消息和临床标注

研究动机

本研究旨在填补以下关键空白：

开发专门针对精神健康危机的安全过滤器
构建临床验证的心理健康危机检测数据集
建立标准化的评估框架

核心贡献

定义了八个心理健康危机维度：与临床专家合作，识别出最紧急和高风险的表现形式，包括虐待、忽视、饮食障碍行为、精神病、自伤、自杀、物质滥用、对他人暴力和混合表现
开发了VBHSF系统：基于Transformer的专用心理健康安全过滤器，能够识别和分类用户消息中的危机信号
构建了Verily心理健康危机数据集v1.0：包含1,800条反映真实数字通信模式的模拟消息，经两名执业临床医师标注
建立了评估基准：在内部和外部数据集上评估性能，并与最先进的通用防护栏进行比较