2025-11-13T20:28:11.151929

NinA: Normalizing Flows in Action. Training VLA Models with Normalizing Flows

Tarasov, Nikulin, Zisman et al.
Recent advances in Vision-Language-Action (VLA) models have established a two-component architecture, where a pre-trained Vision-Language Model (VLM) encodes visual observations and task descriptions, and an action decoder maps these representations to continuous actions. Diffusion models have been widely adopted as action decoders due to their ability to model complex, multimodal action distributions. However, they require multiple iterative denoising steps at inference time or downstream techniques to speed up sampling, limiting their practicality in real-world settings where high-frequency control is crucial. In this work, we present NinA (Normalizing Flows in Action), a fast and expressive alternative to diffusion-based decoders for VLAs. NinA replaces the diffusion action decoder with a Normalizing Flow (NF) that enables one-shot sampling through an invertible transformation, significantly reducing inference time. We integrate NinA into the FLOWER VLA architecture and fine-tune on the LIBERO benchmark. Our experiments show that NinA matches the performance of its diffusion-based counterpart under the same training regime, while achieving substantially faster inference. These results suggest that NinA offers a promising path toward efficient, high-frequency VLA control without compromising performance.
academic

NinA: Normalizing Flows in Action. Training VLA Models with Normalizing Flows

Basic Information

  • Paper ID: 2508.16845
  • Title: NinA: Normalizing Flows in Action. Training VLA Models with Normalizing Flows
  • Authors: Denis Tarasov, Alexander Nikulin, Ilya Zisman, Albina Klepach, Nikita Lyubaykin, Andrei Polubarov, Alexander Derevyagin, Vladislav Kurenkov
  • Categories: cs.CV cs.AI cs.LG
  • Conference: NeurIPS 2025 Workshop: Space in Vision, Language, and Embodied AI
  • Paper Link: https://arxiv.org/abs/2508.16845

Abstract

Recent advances in Vision-Language-Action (VLA) models have established a dual-component architecture: a pre-trained Vision-Language Model (VLM) encodes visual observations and task descriptions, while an action decoder maps these representations to continuous actions. Diffusion models have been widely adopted as action decoders due to their ability to model complex multimodal action distributions. However, they require multiple iterative denoising steps during inference, limiting their practicality in real-world scenarios requiring high-frequency control. This paper proposes NinA (Normalizing Flows in Action), a fast and expressive alternative to VLA diffusion decoders. NinA replaces the diffusion action decoder with normalizing flows (NF), enabling one-shot sampling through invertible transformations and significantly reducing inference time. Experiments demonstrate that NinA matches the performance of diffusion-based counterparts under identical training regimes while achieving substantially faster inference speeds.

Research Background and Motivation

Problem Definition

Current VLA models universally adopt diffusion models as action decoders. While capable of modeling complex multimodal action distributions, they suffer from inference latency issues:

  1. Inference Efficiency Bottleneck: Diffusion models require autoregressive denoising processes with multiple forward passes
  2. Real-time Control Requirements: Fine-grained robot control demands high-frequency responses, with latency being a critical limiting factor
  3. Computational Resource Consumption: Multi-step sampling increases computational overhead

Research Motivation

Robot control imposes stringent real-time requirements, and the multi-step sampling mechanism of existing diffusion models becomes a deployment bottleneck. Normalizing flows as generative models offer the following advantages:

  • Single forward pass for sample generation
  • Exact likelihood estimation
  • Support for variational inference and uncertainty quantification
  • Demonstrated potential in imitation learning and reinforcement learning

Core Contributions

  1. Proposed NinA Framework: First application of normalizing flows to VLA action decoding, enabling efficient one-shot action generation
  2. Dual Architecture Design: Developed two normalizing flow variants based on MLP and Transformer architectures, balancing efficiency and performance
  3. Performance Validation: Demonstrated on LIBERO benchmarks that NinA achieves comparable performance to diffusion models while achieving 7-10× inference acceleration
  4. Comprehensive Analysis: Provided detailed ablation studies and hyperparameter analysis, offering guidance for normalizing flow applications in robot control

Methodology

Task Definition

Given visual observation oto_t and text instruction gg, the VLA model must generate corresponding robot actions ata_t. The objective is to maximize the log-likelihood of expert actions:

LVLA(θ)=E(ot,g,at)D[logπθ(atVLM(ot,g))]\mathcal{L}_{VLA}(\theta) = \mathbb{E}_{(o_t,g,a_t)\sim D} [\log \pi_\theta(a_t | \text{VLM}(o_t,g))]

Model Architecture

Overall Framework

NinA employs a modular design, maintaining the VLM encoder from FLOWER while replacing only the action decoder:

  1. VLM Encoder: ht=VLM(ot,g)h_t = \text{VLM}(o_t, g) generates multimodal embeddings
  2. Normalizing Flow Decoder: atπθ(ht)a_t \sim \pi_\theta(\cdot | h_t) generates action sequences

Normalizing Flow Design

Based on the RealNVP architecture, implementing a sequence of invertible transformations:

logpθ(zK)=logp0(z0)k=1Klogdetfkzk1\log p_\theta(z_K) = \log p_0(z_0) - \sum_{k=1}^K \log \left|\det \frac{\partial f_k}{\partial z_{k-1}}\right|

where z0N(0,I)z_0 \sim \mathcal{N}(0, I) is the base distribution and fθ=fKf1f_\theta = f_K \circ \cdots \circ f_1 is the sequence of invertible transformations.

Dual Architecture Variants

MLP Variant:

  • Action vectors split element-wise: (x1,x2)(x_1, x_2)
  • Conditional network: gϕk(x1,ht)g_{\phi_k}(x_1, h_t) implements conditioning via concatenation
  • Affine transformation: y2=exp(s)x2+by_2 = \exp(s) \cdot x_2 + b
  • Parameter count: 2M, fastest inference speed

Transformer Variant:

  • Action sequences split sequentially
  • Conditional network: self-attention + cross-attention mechanisms
  • Enhanced expressiveness and scalability
  • Parameter count: 38M, superior performance

Technical Innovations

  1. Noise Injection Strategy: Adding Gaussian noise N(0,σnoise2)\mathcal{N}(0, \sigma^2_{noise}) to actions during training as a regularization technique
  2. PLU Layer Integration: Introducing trainable invertible linear layers to enhance expressiveness
  3. Conditioning Mechanism: MLP implements conditioning via concatenation; Transformer via cross-attention
  4. Stability Optimization: Applying tanh activation to scale parameters to prevent training instability

Experimental Setup

Datasets

Evaluated on LIBERO benchmarks, including five sub-tasks:

  • LIBERO Spatial: Spatial reasoning tasks
  • LIBERO Object: Object manipulation tasks
  • LIBERO Goal: Goal-oriented tasks
  • LIBERO 10: 10-task combination
  • LIBERO 90: 90-task combination

Evaluation Metrics

Task success rate serves as the primary evaluation metric, reporting success rates for each sub-task and average performance.

Comparison Methods

  • FLOWER (330M): Original diffusion policy model
  • FLOWER (31M): Reduced diffusion model with matched parameter count
  • Ablation Variants: Removing PLU layers, noise injection, robot pre-training, etc.

Implementation Details

  • Hardware: NVIDIA H100 GPU for training, RTX 3060 for inference testing
  • Training: 100 epochs, batch size 80
  • VLM: Florence-2 Large
  • Hyperparameters optimized on LIBERO-10 and applied to all tasks

Experimental Results

Main Results

ModelLIBERO SpatialLIBERO ObjectLIBERO GoalLIBERO 10LIBERO 90Average
Diffusion (330M)0.9820.9760.9420.9060.9540.952
Diffusion (31M)0.8900.9840.9520.8640.8940.916
NinA Transformer (38M)0.9700.9780.9380.9200.8870.938
NinA MLP (2M)0.8780.9820.9020.9280.8560.909

Inference Efficiency Comparison

ModelParameter CountH100 Inference TimeRTX 3060 Inference Time
Diffusion (330M)330M0.110s0.163s
Diffusion (31M)31M0.120s0.181s
NinA Transformer (38M)38M0.021s0.023s
NinA MLP (2M)2M0.015s0.019s

Ablation Studies

Noise Injection Impact:

  • NinA Transformer: 0.938 → 0.896 (without noise)
  • NinA MLP: 0.909 → 0.880 (without noise)

PLU Layer Impact:

  • Slight improvement for Transformer (0.934 vs 0.938)
  • Mixed effects for MLP

Hyperparameter Analysis:

  • Optimal flow depth: 18 for Transformer, 28 for MLP
  • Optimal hidden dimension: 256 for Transformer, 64 for MLP
  • Optimal noise standard deviation: 0.03 for both

Key Findings

  1. Significant Efficiency Gains: NinA achieves 7-10× inference acceleration with 8.7× parameter reduction
  2. Stable Performance: Only 1.4% performance drop (0.938 vs 0.952)
  3. Clear Architecture Trade-offs: MLP is faster but slightly lower performance; Transformer balances performance and efficiency
  4. Noise Injection Critical: Serves as an important regularization technique with significant performance impact

VLA Model Development

  • Early Work: RT-1, RT-2 established foundational Vision-Language-Action frameworks
  • Architectural Evolution: π0, π0.5, FLOWER established the dual-component VLM + action expert architecture
  • Diffusion Applications: Current mainstream adopts diffusion models as action decoders

Normalizing Flow Research

  • Theoretical Foundation: NICE, RealNVP established theoretical frameworks for invertible transformations
  • Control Applications: Recent work explores normalizing flows in imitation learning and reinforcement learning
  • Advantageous Properties: Exact likelihood estimation, single-step sampling, variational inference support

Conclusions and Discussion

Main Conclusions

  1. Feasibility Validation: Normalizing flows can serve as an effective alternative to diffusion models
  2. Efficiency Improvement: Significantly reduces inference time and parameter requirements
  3. Performance Maintenance: Maintains competitive performance while substantially improving efficiency
  4. Practical Value: Provides a new technical pathway for real-time robot control

Limitations

  1. Limited Evaluation Scope: Validation only on LIBERO benchmarks, lacking real robot experiments
  2. Missing Pre-training: No complete VLA pre-training, only action decoder fine-tuning
  3. Task Complexity: LIBERO tasks are relatively simple; performance on complex manipulation unknown
  4. Insufficient Theoretical Analysis: Lacks theoretical explanation of normalizing flow advantages in action modeling

Future Directions

  1. Large-scale Pre-training: Explore normalizing flow performance in complete VLA pre-training
  2. Real Deployment Validation: Verify real-time control effectiveness on actual robot systems
  3. Theoretical Investigation: Analyze theoretical advantages of normalizing flows over diffusion models
  4. Application Extension: Explore applications in reinforcement learning and uncertainty estimation

In-depth Evaluation

Strengths

  1. Strong Novelty: First introduction of normalizing flows to VLA models with novel and practical approach
  2. Comprehensive Experiments: Provides thorough comparative experiments and ablation studies
  3. High Engineering Value: Significant efficiency improvements have important implications for practical deployment
  4. Method Generality: Can be easily integrated into existing VLA architectures

Weaknesses

  1. Limited Theoretical Depth: Lacks theoretical analysis of method effectiveness
  2. Evaluation Limitations: Testing only in simulation environments, lacking real robot validation
  3. Insufficient Complex Task Verification: LIBERO tasks are relatively simple; complex manipulation capabilities unknown
  4. Long-term Dependency Modeling: Normalizing flow capabilities in long-sequence action modeling require further verification

Impact

  1. Technical Contribution: Provides a new efficient solution for VLA models
  2. Practical Value: Significant inference efficiency improvements have important engineering value
  3. Research Inspiration: Opens new application directions for normalizing flows in robot control
  4. Reproducibility: Open-source code facilitates reproduction and extension

Applicable Scenarios

  1. Real-time Control: Robot control tasks requiring high-frequency responses
  2. Resource-constrained Environments: Edge deployment scenarios with limited computational resources
  3. Uncertainty Quantification: Applications requiring action probability estimation
  4. Online Learning: Online adaptation scenarios requiring fast inference

References

  1. Black et al. π0: A vision-language-action flow model for general robot control
  2. Reuss et al. FLOWER: Democratizing generalist robot policies with efficient vision-language-action flow policies
  3. Dinh et al. Density estimation using real nvp
  4. Liu et al. LIBERO: Benchmarking knowledge transfer for lifelong robot learning
  5. Ghugare & Eysenbach. Normalizing flows are capable models for rl

Summary: NinA proposes an innovative and practical solution that significantly improves VLA model inference efficiency through normalizing flows while maintaining competitive performance. Although theoretical analysis and complex task validation require further refinement, its potential applications in real-time robot control are substantial, providing valuable technical contributions to the field.