PAPER_TITLE

FIRST_AUTHOR_LAST, FIRST_AUTHOR_FIRST; SECOND_AUTHOR_LAST, SECOND_AUTHOR_FIRST

Reshaping Action Error Distributions for Reliable Vision-Language-Action Models

Shuanghao Bai^1,2* Dakai Wang^1* Cheng Chi^2* Wanqi Zhou¹ Jing Lyu^2,3,4* Xiaoguang Zhao³ Pengwei Wang² Zhongyuan Wang² Lei Xing¹ Shanghang Zhang^2,5 Badong Chen¹

¹Institute of Artificial Intelligence and Robotics, Xi'an Jiaotong University
²Beijing Academy of Artificial Intelligence
³Institute of Automation, University of Chinese Academy of Sciences
⁴School of Artificial Intelligence, University of Chinese Academy of Sciences
⁵Peking University

^*Indicates Equal Contribution

Paper arXiv Code (Large-scale Models) Code (Small-scale Models)

Abstract

In robotic manipulation, vision-language-action (VLA) models have emerged as a promising paradigm for learning generalizable and scalable robot policies. Most existing VLA frameworks rely on standard supervised objectives, typically cross-entropy for discrete actions and mean squared error (MSE) for continuous action regression, which impose strong pointwise constraints on individual predictions. In this work, we focus on continuous-action VLA models and move beyond conventional MSE-based regression by reshaping action error distributions during training. Drawing on information-theoretic principles, we introduce Minimum Error Entropy (MEE) into modern VLA architectures and propose a trajectory-level MEE objective, together with two weighted variants, combined with MSE for continuous-action VLA training. We evaluate our approaches across standard, few-shot, and noisy settings on multiple representative VLA architectures, using simulation benchmarks such as LIBERO and SimplerEnv as well as real-world robotic manipulation tasks. Experimental results demonstrate consistent improvements in success rates and robustness across these settings. Under imbalanced data regimes, the gains persist within a well-characterized operating range, while incurring negligible additional training cost and no impact on inference efficiency. We further provide theoretical analyses that explain why MEE-based supervision is effective and characterize its practical range.

Introduction

What is Minimum Error Entropy?

In a standard regression setting, an input $x$ is mapped to an output by a parametric model $f_\theta$. Let $y \in \mathbb{R}^d$ denote the ground-truth target, $\hat y = f_\theta(x)$ the prediction, and $e = y - \hat y$ the prediction error. The Minimum Error Entropy (MEE) principle learns model parameters by minimizing the entropy of the error distribution. When instantiated with Shannon entropy~\cite{shannon1948mathematical}, the objective is \begin{equation} \min_{\theta} \; H(e) = - \int p(e) \log p(e)\, de , \end{equation} where $p(e)$ denotes the probability density of the error variable.

How to adapt MEE to VLA?

In VLA models, let $\hat{\mathbf{a}}_{b,k}^{\,t} \in \mathbb{R}^D$ and $\mathbf{a}_{b,k}^{\,t}$ denote the predicted and ground-truth actions at the $k$-th step of an action chunk generated at time $t$ for trajectory $b$, with chunk size $K$. We define the action prediction error as $\mathbf{e}_{b,k}^{\,t} = \hat{\mathbf{a}}_{b,k}^{\,t} - \mathbf{a}_{b,k}^{\,t}$. Rather than treating errors independently, we aggregate errors across batch, time, and chunk dimensions and regard them as samples from a shared error distribution. Formally, \begin{equation} \mathcal{E} = \left\{ \mathbf{e}_{b,k}^{\,t} \;\middle|\; b=1,\dots,B,\; t=1,\dots,T,\; k=0,\dots,K-1 \right\}. \end{equation} We then apply the quadratic Rényi entropy to the aggregated action error distribution. Specifically, all action error vectors $\mathbf{e}_{b,k}^{\,t}$ across the batch, temporal, and chunk dimensions are flattened into a set of $N = B \times T \times K$ samples, denoted as $\{\mathbf{e}_i\}_{i=1}^{N}$. This yields the following T-MEE empirical objective: \begin{equation}\label{eq: t-mee} \mathcal{L}_{\mathrm{T\text{-}MEE}} = - \log \left( \frac{1}{N^2} \sum_{i=1}^{N} \sum_{j=1}^{N} \exp\left( - \frac{\| \mathbf{e}_i - \mathbf{e}_j \|^2}{2\sigma^2} \right) \right), \end{equation} where $\sigma$ denotes the kernel bandwidth. Building upon the trajectory-level MEE (T-MEE) objective, we introduce a unified weighted formulation that accounts for the varying reliability of action error samples. Let $\{\mathbf{e}_i\}_{i=1}^{N}$ denote the aggregated action error vectors collected across batch, time, and chunk dimensions. Each error sample is assigned a non-negative importance weight based on its magnitude, \[ w_i = \frac{ \exp\!\left(-\|\mathbf{e}_i\|^2 / 2\sigma_w^2\right) }{ \sum_{k=1}^{N} \exp\!\left(-\|\mathbf{e}_k\|^2 / 2\sigma_w^2\right) }, \] which downweights unreliable, high-magnitude errors. Using these weights, we define a weighted trajectory-level MEE objective as \[ \mathcal{L}_{\mathrm{W\text{-}TMEE}} = -\log \sum_{i=1}^{N}\sum_{j=1}^{N} \omega_{ij} \exp\!\left( -\frac{\|\mathbf{e}_i-\mathbf{e}_j\|^2}{2\sigma^2} \right), \] where the weighting scheme $\omega_{ij}$ specifies how individual error samples contribute to the entropy estimate. In particular, setting $\omega_{ij}=\frac{1}{N^2}w_i$ yields an asymmetric, chunk-weighted variant (Cw-TMEE) that emphasizes reliable action chunks while aggregating errors across trajectories, whereas setting $\omega_{ij}=w_i w_j$ results in a symmetric, element-weighted variant (Ew-TMEE) that emphasizes pairwise interactions between reliable error elements. Together, these two variants provide flexible mechanisms for incorporating error reliability into trajectory-level entropy minimization. Please refer to the article for insights into how and why T-MEE works.

Contributions

We reformulate action prediction errors as structured distributions and introduce trajectory-level MEE objectives for VLA models, including three variants that explicitly model higher-order error interactions along trajectories.
We provide a unified investigation across theory, simulation, and real-world robotic experiments, in which analytical results, controlled simulations, and physical evaluations play mutually reinforcing roles, yielding a coherent picture of the effectiveness and limitations of the proposed approach.

Experiments

Evaluated Models

Architectural taxonomy of continuous-action VLA models evaluated in this work.

Nearly Balance Setting

1. T-MEE delivers robust gains on LIBERO across all evaluated model scales and VLA frameworks.

2. Experimental results on LIBERO demonstrate that all T-MEE variants consistently surpasses regression baselines across various VLA architectures, with the standard objective providing the most stable and significant performance gains.

3. Real-world evaluations on GR00T N1.5 demonstrate that T-MEE significantly enhances execution stability and success rates, proving that its distribution-level regularization effectively transfers from simulation to physical robotic systems.

GR00T N1.5

Task 1: Place banana

Task 2: Wipe whiteboard

Task 3: Handover cup

Task 4: Handover cup (unseen)

GR00T N1.5 + TMEE

Task 1: Place banana

Task 2: Wipe whiteboard

Task 3: Handover cup

Task 4: Handover cup (unseen)

Few-shot and Noise Setting

1. On the LIBERO benchmarks, T-MEE significantly enhances data efficiency through distribution-level supervision, consistently outperforming the GR00T baseline across all task suites and data regimes.

Few-shot evaluation on LIBERO under a 0.2 training ratio.

2. T-MEE effectively mitigates the impact of non-Gaussian observation noise and action outliers, significantly enhancing both fine-tuned and zero-shot robustness across various VLA architectures.

Imbalanced Setting

1. Experimental results on LIBERO demonstrate that T-MEE consistently improves performance across various VLA architectures and task types under moderate data imbalance, though its effectiveness diminishes under extreme imbalance ratios where minority task supervision becomes severely insufficient.

2. Evaluation on SimplerEnv reveals that T-MEE consistently improves performance on dominant tasks even under highly long-tailed data distributions, demonstrating that its benefits persist even when minority tasks are under-represented or not explicitly evaluated.

Analysis Experiments

1. While T-MEE effectively reshapes action errors to cluster near zero, smaller models still exhibit instantaneous outliers during early trajectory phases due to capacity limits, yet they successfully maintain correct directional trends for task completion.

2. Since T-MEE is introduced after 10k training steps, the two curves largely overlap in the early stage. After T-MEE is activated, the entropy under T-MEE decreases more rapidly and converges to a lower level compared to the baseline.

BibTeX

@article{bai2026reshaping,
  title={Reshaping Action Error Distributions for Reliable Vision-Language-Action Models},
  author={Bai, Shuanghao and Wang, DaKai and Chi, Cheng and Zhou, Wanqi and Lyu, Jing and Zhao, Xiaoguang and Wang, Pengwei and Wang, Zhongyuan and Xing, Lei and Zhang, Shanghang and Chen, Badong},
  journal={arXiv preprint arXiv:},
  year={2026}
}

More Works from Our Lab

Rethinking Latent Redundancy in Behavior Cloning: An Information Bottleneck Approach for Robot Manipulation