NinA: Normalizing Flows in Action. Training VLA Models with Normalizing Flows
Tarasov, Nikulin, Zisman et al.
Recent advances in Vision-Language-Action (VLA) models have established a two-component architecture, where a pre-trained Vision-Language Model (VLM) encodes visual observations and task descriptions, and an action decoder maps these representations to continuous actions. Diffusion models have been widely adopted as action decoders due to their ability to model complex, multimodal action distributions. However, they require multiple iterative denoising steps at inference time or downstream techniques to speed up sampling, limiting their practicality in real-world settings where high-frequency control is crucial. In this work, we present NinA (Normalizing Flows in Action), a fast and expressive alternative to diffusion-based decoders for VLAs. NinA replaces the diffusion action decoder with a Normalizing Flow (NF) that enables one-shot sampling through an invertible transformation, significantly reducing inference time. We integrate NinA into the FLOWER VLA architecture and fine-tune on the LIBERO benchmark. Our experiments show that NinA matches the performance of its diffusion-based counterpart under the same training regime, while achieving substantially faster inference. These results suggest that NinA offers a promising path toward efficient, high-frequency VLA control without compromising performance.
academic
NinA: Normalizing Flows in Action. Training VLA Models with Normalizing Flows
近年来,视觉-语言-动作(VLA)模型的进展确立了一种双组件架构:预训练的视觉-语言模型(VLM)编码视觉观察和任务描述,动作解码器将这些表示映射为连续动作。扩散模型因其建模复杂多模态动作分布的能力而被广泛采用作为动作解码器。然而,它们在推理时需要多次迭代去噪步骤,限制了在需要高频控制的真实世界场景中的实用性。本文提出NinA(Normalizing Flows in Action),作为VLA扩散解码器的快速且富有表现力的替代方案。NinA用归一化流(NF)替换扩散动作解码器,通过可逆变换实现一次性采样,显著减少推理时间。实验表明,NinA在相同训练制度下与基于扩散的对应模型性能匹配,同时实现了显著更快的推理速度。