NinA: Normalizing Flows in Action. Training VLA Models with Normalizing Flows
Tarasov, Nikulin, Zisman et al.
Recent advances in Vision-Language-Action (VLA) models have established a two-component architecture, where a pre-trained Vision-Language Model (VLM) encodes visual observations and task descriptions, and an action decoder maps these representations to continuous actions. Diffusion models have been widely adopted as action decoders due to their ability to model complex, multimodal action distributions. However, they require multiple iterative denoising steps at inference time or downstream techniques to speed up sampling, limiting their practicality in real-world settings where high-frequency control is crucial. In this work, we present NinA (Normalizing Flows in Action), a fast and expressive alternative to diffusion-based decoders for VLAs. NinA replaces the diffusion action decoder with a Normalizing Flow (NF) that enables one-shot sampling through an invertible transformation, significantly reducing inference time. We integrate NinA into the FLOWER VLA architecture and fine-tune on the LIBERO benchmark. Our experiments show that NinA matches the performance of its diffusion-based counterpart under the same training regime, while achieving substantially faster inference. These results suggest that NinA offers a promising path toward efficient, high-frequency VLA control without compromising performance.
近年、視覚-言語-動作(VLA)モデルの進展により、二成分アーキテクチャが確立されました。事前学習済みの視覚-言語モデル(VLM)が視覚観察とタスク記述をエンコードし、動作デコーダがこれらの表現を連続動作にマッピングします。複雑な多峰性動作分布をモデル化する能力から、拡散モデルが動作デコーダとして広く採用されています。しかし、推論時に複数回の反復的なノイズ除去ステップが必要であり、高周波制御が必要な実世界シナリオでの実用性が制限されています。本論文では、NinA(Normalizing Flows in Action)を提案します。これはVLA拡散デコーダの高速で表現力豊かな代替案です。NinAは拡散動作デコーダを正規化フロー(NF)に置き換え、可逆変換を通じてワンショットサンプリングを実現し、推論時間を大幅に削減します。実験により、NinAは同じ訓練体制下で拡散ベースの対応モデルと同等の性能を達成しながら、著しく高速な推論を実現することが示されています。