NinA: Normalizing Flows in Action. Training VLA Models with Normalizing Flows
Tarasov, Nikulin, Zisman et al.
Recent advances in Vision-Language-Action (VLA) models have established a two-component architecture, where a pre-trained Vision-Language Model (VLM) encodes visual observations and task descriptions, and an action decoder maps these representations to continuous actions. Diffusion models have been widely adopted as action decoders due to their ability to model complex, multimodal action distributions. However, they require multiple iterative denoising steps at inference time or downstream techniques to speed up sampling, limiting their practicality in real-world settings where high-frequency control is crucial. In this work, we present NinA (Normalizing Flows in Action), a fast and expressive alternative to diffusion-based decoders for VLAs. NinA replaces the diffusion action decoder with a Normalizing Flow (NF) that enables one-shot sampling through an invertible transformation, significantly reducing inference time. We integrate NinA into the FLOWER VLA architecture and fine-tune on the LIBERO benchmark. Our experiments show that NinA matches the performance of its diffusion-based counterpart under the same training regime, while achieving substantially faster inference. These results suggest that NinA offers a promising path toward efficient, high-frequency VLA control without compromising performance.
academic
NinA: Normalizing Flows in Action. Training VLA Models with Normalizing Flows
Recent advances in Vision-Language-Action (VLA) models have established a dual-component architecture: a pre-trained Vision-Language Model (VLM) encodes visual observations and task descriptions, while an action decoder maps these representations to continuous actions. Diffusion models have been widely adopted as action decoders due to their ability to model complex multimodal action distributions. However, they require multiple iterative denoising steps during inference, limiting their practicality in real-world scenarios requiring high-frequency control. This paper proposes NinA (Normalizing Flows in Action), a fast and expressive alternative to VLA diffusion decoders. NinA replaces the diffusion action decoder with normalizing flows (NF), enabling one-shot sampling through invertible transformations and significantly reducing inference time. Experiments demonstrate that NinA matches the performance of diffusion-based counterparts under identical training regimes while achieving substantially faster inference speeds.
Current VLA models universally adopt diffusion models as action decoders. While capable of modeling complex multimodal action distributions, they suffer from inference latency issues:
Robot control imposes stringent real-time requirements, and the multi-step sampling mechanism of existing diffusion models becomes a deployment bottleneck. Normalizing flows as generative models offer the following advantages:
Single forward pass for sample generation
Exact likelihood estimation
Support for variational inference and uncertainty quantification
Demonstrated potential in imitation learning and reinforcement learning
Proposed NinA Framework: First application of normalizing flows to VLA action decoding, enabling efficient one-shot action generation
Dual Architecture Design: Developed two normalizing flow variants based on MLP and Transformer architectures, balancing efficiency and performance
Performance Validation: Demonstrated on LIBERO benchmarks that NinA achieves comparable performance to diffusion models while achieving 7-10× inference acceleration
Comprehensive Analysis: Provided detailed ablation studies and hyperparameter analysis, offering guidance for normalizing flow applications in robot control
Given visual observation ot and text instruction g, the VLA model must generate corresponding robot actions at. The objective is to maximize the log-likelihood of expert actions:
Black et al. π0: A vision-language-action flow model for general robot control
Reuss et al. FLOWER: Democratizing generalist robot policies with efficient vision-language-action flow policies
Dinh et al. Density estimation using real nvp
Liu et al. LIBERO: Benchmarking knowledge transfer for lifelong robot learning
Ghugare & Eysenbach. Normalizing flows are capable models for rl
Summary: NinA proposes an innovative and practical solution that significantly improves VLA model inference efficiency through normalizing flows while maintaining competitive performance. Although theoretical analysis and complex task validation require further refinement, its potential applications in real-time robot control are substantial, providing valuable technical contributions to the field.