2025-11-24T21:52:17.543196

Instruction Set Migration at Warehouse Scale

Christopher, Crossan, Dobson et al.

Migrating codebases from one instruction set architecture (ISA) to another is a major engineering challenge. A recent example is the adoption of Arm (in addition to x86) across the major Cloud hyperscalers. Yet, this problem has seen limited attention by the academic community. Most work has focused on static and dynamic binary translation, and the traditional conventional wisdom has been that this is the primary challenge. In this paper, we show that this is no longer the case. Modern ISA migrations can often build on a robust open-source ecosystem, making it possible to recompile all relevant software from scratch. This introduces a new and multifaceted set of challenges, which are different from binary translation. By analyzing a large-scale migration from x86 to Arm at Google, spanning almost 40,000 code commits, we derive a taxonomy of tasks involved in ISA migration. We show how Google automated many of the steps involved, and demonstrate how AI can play a major role in automatically addressing these tasks. We identify tasks that remain challenging and highlight research challenges that warrant further attention.

academic

Instruction Set Migration at Warehouse Scale

基本信息

论文ID: 2510.14928
标题: Instruction Set Migration at Warehouse Scale
作者: Eric Christopher, Kevin Crossan, Wolff Dobson, Chris Kennelly, Drew Lewis, Kun Lin, Martin Maas, Parthasarathy Ranganathan, Emma Rapati, Brian Yang (Google, USA)
分类: cs.SE (Software Engineering), cs.LG (Machine Learning)
发表时间: 2025年10月16日 (arXiv预印本)
论文链接: https://arxiv.org/abs/2510.14928

摘要

本文通过分析Google从x86到Arm的大规模ISA迁移（涉及近40,000个代码提交），挑战了传统关于指令集架构迁移的观点。研究表明，现代ISA迁移的主要挑战不再是代码翻译，而是一系列多方面的工程任务。论文提出了ISA迁移任务的分类法，展示了Google如何自动化许多迁移步骤，并证明AI在自动化处理这些任务中的重要作用。

研究背景与动机

问题定义

核心问题：大规模代码库的指令集架构迁移是一项重大工程挑战，但学术界对此关注有限
实际需求：主要云服务商（Amazon、Google、Microsoft）都在采用Arm架构以补充x86，需要系统性的迁移方法论
传统观点局限：过往研究主要关注静态和动态二进制翻译，认为这是ISA迁移的主要挑战

研究动机

技术环境变化：现代ISA在上游编译器、运行时库和Linux内核中得到良好支持，使得从源码重新编译成为可能
实践经验缺失：缺乏对现代ISA迁移实际涉及任务的系统性分析
自动化机遇：现代软件工程工具和AI技术为自动化迁移提供了新的可能性

核心贡献

首次系统性分析：提供了首个大规模ISA迁移任务的详细分解和分类法，基于38,156个真实提交
颠覆传统认知：证明ISA迁移的复杂性不在于代码翻译，而主要涉及构建和配置文件的重写
自动化框架：展示了许多迁移任务高度可自动化的特性，并开发了AI驱动的自动化工具CogniPort
实用指导：识别仍然具有挑战性的任务，为未来研究方向提供指导

方法详解

任务定义

本文研究的核心任务是：将多十亿行代码库从x86架构迁移到同时支持x86和Arm的多架构环境

输入：单架构(x86)的大规模代码库输出：支持多架构(x86+Arm)的代码库约束：保持性能、安全性和稳定性的对等性

分析方法架构

1. 数据收集与标注

数据源：Google单一代码仓库中38,156个与Arm迁移相关的提交
自动分类：使用Gemini 2.5 Flash LLM进行大规模提交分析
分类流程：
1. 将提交消息和代码差异传入LLM的1M token上下文窗口
2. 每批100个提交，让模型选择20个类别
3. 整合400×20个类别为50个，最终手工精化为16个类别