2025-11-13T13:37:11.114102

Exploring Distributed Vector Databases Performance on HPC Platforms: A Study with Qdrant

Ockerman, Gueroudji, Oh et al.

Vector databases have rapidly grown in popularity, enabling efficient similarity search over data such as text, images, and video. They now play a central role in modern AI workflows, aiding large language models by grounding model outputs in external literature through retrieval-augmented generation. Despite their importance, little is known about the performance characteristics of vector databases in high-performance computing (HPC) systems that drive large-scale science. This work presents an empirical study of distributed vector database performance on the Polaris supercomputer in the Argonne Leadership Computing Facility. We construct a realistic biological-text workload from BV-BRC and generate embeddings from the peS2o corpus using Qwen3-Embedding-4B. We select Qdrant to evaluate insertion, index construction, and query latency with up to 32 workers. Informed by practical lessons from our experience, this work takes a first step toward characterizing vector database performance on HPC platforms to guide future research and optimization.

academic

Exploring Distributed Vector Databases Performance on HPC Platforms: A Study with Qdrant

基本信息

论文ID: 2509.12384
标题: Exploring Distributed Vector Databases Performance on HPC Platforms: A Study with Qdrant
作者: Seth Ockerman, Amal Gueroudji, Song Young Oh, Robert Underwood, Nicholas Chia, Kyle Chard, Robert Ross, Shivaram Venkataraman
分类: cs.DC cs.DB
发表时间/会议: SC'25 Workshop Frontiers in Generative AI for HPC Science and Engineering: Foundations, Challenges, and Opportunities
论文链接: https://arxiv.org/abs/2509.12384

摘要

向量数据库在现代AI工作流中扮演着核心角色，特别是在检索增强生成(RAG)系统中，通过将大语言模型输出与外部文献关联来提升模型性能。尽管向量数据库在AI应用中日益重要，但对其在高性能计算(HPC)系统中的性能特征了解甚少。本研究在阿贡国家实验室的Polaris超级计算机上对分布式向量数据库Qdrant进行了实证研究，构建了基于BV-BRC的真实生物文本工作负载，使用Qwen3-Embedding-4B模型生成嵌入向量，评估了最多32个工作节点下的插入、索引构建和查询性能。

研究背景与动机

问题定义

核心问题：向量数据库在HPC环境中的性能特征缺乏深入研究，现有研究主要集中在单GPU或小规模环境
重要性：大规模科学计算越来越多地在HPC系统上执行，向量数据库必须适应HPC环境的独特特征（专用互连、并行文件系统、深度内存层次结构、异构硬件架构）
现有局限性：
- 缺乏针对HPC环境的向量数据库性能评估
- 现有研究主要关注功能特性比较，缺乏实证性能评估
- 科学工作负载与商业应用存在显著差异

研究动机

随着AI系统在科学研究中的广泛应用，特别是RAG技术的普及，理解向量数据库在HPC架构上的性能表现对于系统设计、性能优化和未来研究具有重要指导意义。

核心贡献

首次HPC环境评估：在Polaris超级计算机上评估了Qdrant分布式性能，测试了最多32个工作节点（跨8个计算节点）的插入、索引构建和查询性能
实际科学工作负载：构建了基于BV-BRC生物数据和peS2o科学文本语料库的真实工作负载
性能特征分析：提供了向量数据库在HPC平台上性能特征的首次系统性分析
开放数据集：发布了科学嵌入数据集和查询工作负载供未来研究使用
实践指导：基于部署经验提供了实用建议和未来研究方向

方法详解

任务定义

本研究构建了一个端到端的生物学RAG工作流，包括：

输入：BV-BRC中的22,723个基因组相关术语
处理：使用每个术语在peS2o数据集（800万篇全文论文）中搜索相关数据
输出：为RAG系统提供上下文信息的检索结果

系统架构

分布式向量数据库架构

论文比较了两种主要的分布式架构：

有状态架构（Qdrant采用）：
- 每个工作节点存储状态（索引或数据）并负责计算
- 工作节点既"拥有"又负责数据集的一部分
- 查询广播到所有工作节点，各节点执行ANN搜索后聚合结果
无状态架构（计算存储分离）：
- 工作节点执行计算但不持久存储数据
- 数据存储在独立的持久存储层
- 需要时将数据加载到缓存层

实验平台配置

硬件：Polaris超级计算机
- 每个计算节点：2.8 GHz AMD EPYC Milan 7543P 32核CPU
- 内存：512 GB DDR4 RAM
- GPU：4个NVIDIA A100 GPU
- 互连：HPE Slingshot 11，Dragonfly拓扑
软件：Qdrant向量数据库，使用HNSW索引