2025-11-25T11:37:18.016926

Augmenting Smart Contract Decompiler Output through Fine-grained Dependency Analysis and LLM-facilitated Semantic Recovery

Liao, Nan, Gao et al.

Decompiler is a specialized type of reverse engineering tool extensively employed in program analysis tasks, particularly in program comprehension and vulnerability detection. However, current Solidity smart contract decompilers face significant limitations in reconstructing the original source code. In particular, the bottleneck of SOTA decompilers lies in inaccurate method identification, incorrect variable type recovery, and missing contract attributes. These deficiencies hinder downstream tasks and understanding of the program logic. To address these challenges, we propose SmartHalo, a new framework that enhances decompiler output by combining static analysis (SA) and large language models (LLM). SmartHalo leverages the complementary strengths of SA's accuracy in control and data flow analysis and LLM's capability in semantic prediction. More specifically, \system{} constructs a new data structure - Dependency Graph (DG), to extract semantic dependencies via static analysis. Then, it takes DG to create prompts for LLM optimization. Finally, the correctness of LLM outputs is validated through symbolic execution and formal verification. Evaluation on a dataset consisting of 465 randomly selected smart contract methods shows that SmartHalo significantly improves the quality of the decompiled code, compared to SOTA decompilers (e.g., Gigahorse). Notably, integrating GPT-4o with SmartHalo further enhances its performance, achieving precision rates of 87.39% for method boundaries, 90.39% for variable types, and 80.65% for contract attributes.

academic

Augmenting Smart Contract Decompiler Output through Fine-grained Dependency Analysis and LLM-facilitated Semantic Recovery

基本信息

论文ID: 2501.08670
标题: Augmenting Smart Contract Decompiler Output through Fine-grained Dependency Analysis and LLM-facilitated Semantic Recovery
作者: Zeqin Liao, Yuhong Nan, Zixu Gao, Henglong Liang, Sicheng Hao, Peifan Ren, Zibin Zheng
分类: cs.SE (Software Engineering)
发表时间: 2025年1月 (arXiv预印本)
论文链接: https://arxiv.org/abs/2501.08670

摘要

智能合约反编译器是程序分析中广泛使用的逆向工程工具，特别在程序理解和漏洞检测方面发挥重要作用。然而，当前的Solidity智能合约反编译器在重构原始源代码方面存在显著局限性，主要体现在函数识别不准确、变量类型恢复错误和合约属性缺失三个方面。为解决这些挑战，本文提出SmartHalo框架，通过结合静态分析(SA)和大语言模型(LLM)来增强反编译器输出。SmartHalo利用SA在控制流和数据流分析方面的准确性以及LLM在语义预测方面的能力。具体而言，该框架构建了依赖图(DG)数据结构来提取语义依赖关系，然后基于DG创建LLM优化提示，最后通过符号执行和形式化验证来验证LLM输出的正确性。

研究背景与动机

问题定义

智能合约反编译面临三个核心问题：

函数边界识别不准确：现有反编译器无法精确确定函数边界，经常将多个函数错误地恢复为单个函数，或遗漏重要函数
变量类型恢复错误：反编译器产生的类型错误与静态域规则不一致，如将keccak256函数的bytes32返回值错误恢复为uint256类型
合约属性缺失：智能合约中的状态变量记录关键合约属性（如资产、身份、路由器），但在反编译代码中完全缺失

重要性分析

这些缺陷严重阻碍了下游任务：

影响漏洞检测的准确性，产生误报和漏报
降低程序理解的效率
限制跨合约调用流分析等高级分析任务

现有方法局限性

SmartDagger：仅能部分恢复状态变量的合约属性，基于深度学习模型，在新兴合约上性能下降
Neural-FEBI：不支持修饰符函数或继承函数的边界恢复
SigRec/VarLifter/DeepInfer：仅能部分恢复已知函数签名的参数类型，依赖预定义启发式规则，覆盖率低

研究动机

基于两个关键洞察：

软件自然模式：程序员倾向于在相似上下文中使用相似的代码结构、合约属性、变量类型和函数边界
SA与LLM协同增强：SA在处理复杂静态约束方面准确性高，LLM在预测缺乏静态约束的目标方面具有灵活性

核心贡献

识别并系统化分析了当前智能合约反编译器输出的关键局限性
提出SmartHalo框架，创新性地结合静态分析和大语言模型来优化反编译器输出
设计依赖图(DG)数据结构，提取三种类型的语义依赖关系（状态依赖、控制流依赖、类型依赖）
建立严格的正确性验证机制，通过符号执行和形式化验证来处理LLM幻觉问题
全面评估验证了SmartHalo在函数边界、变量类型和合约属性恢复方面的有效性

方法详解

任务定义

输入：反编译器生成的伪代码输出：优化后的反编译代码，包含准确的函数边界、变量类型和合约属性约束：保持程序行为等价性，遵循Solidity静态类型规则

模型架构

SmartHalo采用三阶段架构：

1. 基于依赖关系的语义提取

控制流分析：使用Tree-sitter构建语法树，转换为三地址中间表示，生成控制流和数据流图
依赖关系识别：
- 类型依赖：变量类型与其他变量或表达式的关联关系
- 状态依赖：状态变量间的读写依赖关系
- 控制流依赖：程序执行路径的依赖关系
依赖图构建：DG = (Nc, Ec, Xe)，其中Nc为节点集合（变量和表达式），Ec为边集合（三种依赖关系），Xe为标签函数

2. LLM驱动的语义增强

代码上下文生成：
- 变量：基于DG进行代码切片，提取目标变量相关的代码片段
- 函数：搜索目标函数所在的调用链
推理候选生成：
- 变量类型候选：从Solidity文档收集内置类型
- 合约属性候选：Limit, Fee, Flag, Address, Asset, Router, Others
思维链(COT)提示：将DG中的依赖关系转换为推理步骤描述