2025-11-14T08:10:11.695880

Project-Level C-to-Rust Translation via Synergistic Integration of Knowledge Graphs and Large Language Models

Yuan, Mao, Chen et al.

Translating C code into safe Rust is an effective way to ensure its memory safety. Compared to rule-based translation which produces Rust code that remains largely unsafe, LLM-based methods can generate more idiomatic and safer Rust code because LLMs have been trained on vast amount of human-written idiomatic code. Although promising, existing LLM-based methods still struggle with project-level C-to-Rust translation. They typically partition a C project into smaller units (\eg{} functions) based on call graphs and translate them bottom-up to resolve program dependencies. However, this bottom-up, unit-by-unit paradigm often fails to translate pointers due to the lack of a global perspective on their usage. To address this problem, we propose a novel C-Rust Pointer Knowledge Graph (KG) that enriches a code-dependency graph with two types of pointer semantics: (i) pointer-usage information which record global behaviors such as points-to flows and map lower-level struct usage to higher-level units; and (ii) Rust-oriented annotations which encode ownership, mutability, nullability, and lifetime. Synthesizing the \kg{} with LLMs, we further propose \ourtool{}, which implements a project-level C-to-Rust translation technique. In \ourtool{}, the \kg{} provides LLMs with comprehensive pointer semantics from a global perspective, thus guiding LLMs towards generating safe and idiomatic Rust code from a given C project. Our experiments show that \ourtool{} reduces unsafe usages in translated Rust by 99.9\% compared to both rule-based translation and traditional LLM-based rewriting, while achieving an average 29.3\% higher functional correctness than those fuzzing-enhanced LLM methods.

academic

Project-Level C-to-Rust Translation via Synergistic Integration of Knowledge Graphs and Large Language Models

基本信息

论文ID: 2510.10956
标题: Project-Level C-to-Rust Translation via Synergistic Integration of Knowledge Graphs and Large Language Models
作者: Zhiqiang Yuan, Wenjun Mao, Zhuo Chen, Xiyue Shang, Chong Wang, Yiling Lou, Xin Peng
分类: cs.SE (Software Engineering), cs.AI (Artificial Intelligence)
发表时间: 2025年10月13日
论文链接: https://arxiv.org/abs/2510.10956

摘要

将C代码翻译为安全的Rust代码是确保内存安全的有效方法。与产生大量不安全代码的基于规则的翻译方法相比，基于大语言模型(LLM)的方法能够生成更加惯用和安全的Rust代码。然而，现有的LLM方法在项目级C到Rust翻译中仍然存在困难，特别是在处理指针时缺乏全局视角。为解决这一问题，本文提出了一种新颖的C-Rust指针知识图谱(KG)，该图谱通过两种指针语义丰富了代码依赖图：(1)记录全局行为的指针使用信息；(2)编码所有权、可变性、可空性和生命周期的Rust导向注释。基于此知识图谱，提出了PTRMAPPER技术，实现了项目级C到Rust翻译。实验表明，PTRMAPPER相比基于规则和传统LLM方法减少了99.9%的不安全代码使用，并在功能正确性上平均提高了29.3%。

研究背景与动机

1. 问题定义

C语言广泛应用于操作系统、嵌入式系统和性能关键应用中，但其手动内存管理和直接指针操作经常导致缓冲区溢出、内存泄漏等安全漏洞。Rust作为现代替代方案，在保持C语言性能的同时确保内存安全。因此，将遗留C代码自动翻译为Rust成为迫切需求。

2. 现有方法局限性

基于规则的方法：依赖预定义规则，产生的Rust代码大量使用unsafe块、原始指针和外部函数调用，仍存在安全风险
现有LLM方法：采用自底向上的单元翻译范式，缺乏全局指针使用视角，经常在指针翻译中出现定义-使用冲突

3. 核心挑战

C和Rust在指针使用上存在根本差异：

C语言：指针高度灵活，定义时仅指定类型，使用时可自由读写修改
Rust语言：指针使用严格受限，定义时必须明确所有权、可变性和生命周期，使用时必须严格遵守借用检查器规则

核心贡献

提出了新颖的C-Rust指针知识图谱(KG)：能够从项目级上下文全面建模代码单元依赖、指针使用信息和Rust导向注释
设计了PTRMAPPER项目级翻译技术：利用C-Rust指针KG与LLM协同，将C项目翻译为安全惯用的Rust代码
实现了显著的性能提升：生成更惯用、更安全、正确性更高的Rust代码
提供了全面的实验验证：消融实验证明了指针使用信息和Rust导向注释对提升翻译性能的重要性

方法详解

任务定义

给定一个C项目，自动将其翻译为功能等价、内存安全且惯用的Rust项目。输入为C源码文件集合，输出为可编译运行的Rust项目。

模型架构

PTRMAPPER包含三个主要阶段：

1. C-Rust指针知识图谱构建

知识图谱模式设计：

代码依赖图：描述代码单元(函数、结构体、枚举等)间的依赖关系
指针使用信息：捕获项目级指针行为，包括：
- 实体：Param(参数)、Value(返回值)、Member(成员)、Pointer(指针)等
- 关系：derivesFrom(派生关系)、mayAlias/noAlias(别名关系)、pointsTo(指向关系)等
Rust导向注释：包含所有权、可变性、可空性、生命周期等语义