VLA-TMEE Logo Reshaping Action Error Distributions for Reliable Vision-Language-Action Models

1Institute of Artificial Intelligence and Robotics, Xi'an Jiaotong University
2Beijing Academy of Artificial Intelligence
3Institute of Automation, University of Chinese Academy of Sciences
4School of Artificial Intelligence, University of Chinese Academy of Sciences
5Peking University

*Indicates Equal Contribution

Abstract

In robotic manipulation, vision-language-action (VLA) models have emerged as a promising paradigm for learning generalizable and scalable robot policies. Most existing VLA frameworks rely on standard supervised objectives, typically cross-entropy for discrete actions and mean squared error (MSE) for continuous action regression, which impose strong pointwise constraints on individual predictions. In this work, we focus on continuous-action VLA models and move beyond conventional MSE-based regression by reshaping action error distributions during training. Drawing on information-theoretic principles, we introduce Minimum Error Entropy (MEE) into modern VLA architectures and propose a trajectory-level MEE objective, together with two weighted variants, combined with MSE for continuous-action VLA training. We evaluate our approaches across standard, few-shot, and noisy settings on multiple representative VLA architectures, using simulation benchmarks such as LIBERO and SimplerEnv as well as real-world robotic manipulation tasks. Experimental results demonstrate consistent improvements in success rates and robustness across these settings. Under imbalanced data regimes, the gains persist within a well-characterized operating range, while incurring negligible additional training cost and no impact on inference efficiency. We further provide theoretical analyses that explain why MEE-based supervision is effective and characterize its practical range.

Introduction

What is Minimum Error Entropy?

    In a standard regression setting, an input $x$ is mapped to an output by a parametric model $f_\theta$. Let $y \in \mathbb{R}^d$ denote the ground-truth target, $\hat y = f_\theta(x)$ the prediction, and $e = y - \hat y$ the prediction error. The Minimum Error Entropy (MEE) principle learns model parameters by minimizing the entropy of the error distribution. When instantiated with Shannon entropy~\cite{shannon1948mathematical}, the objective is \begin{equation} \min_{\theta} \; H(e) = - \int p(e) \log p(e)\, de , \end{equation} where $p(e)$ denotes the probability density of the error variable.

How to adapt MEE to VLA?

    In VLA models, let $\hat{\mathbf{a}}_{b,k}^{\,t} \in \mathbb{R}^D$ and $\mathbf{a}_{b,k}^{\,t}$ denote the predicted and ground-truth actions at the $k$-th step of an action chunk generated at time $t$ for trajectory $b$, with chunk size $K$. We define the action prediction error as $\mathbf{e}_{b,k}^{\,t} = \hat{\mathbf{a}}_{b,k}^{\,t} - \mathbf{a}_{b,k}^{\,t}$. Rather than treating errors independently, we aggregate errors across batch, time, and chunk dimensions and regard them as samples from a shared error distribution. Formally, \begin{equation} \mathcal{E} = \left\{ \mathbf{e}_{b,k}^{\,t} \;\middle|\; b=1,\dots,B,\; t=1,\dots,T,\; k=0,\dots,K-1 \right\}. \end{equation} We then apply the quadratic Rényi entropy to the aggregated action error distribution. Specifically, all action error vectors $\mathbf{e}_{b,k}^{\,t}$ across the batch, temporal, and chunk dimensions are flattened into a set of $N = B \times T \times K$ samples, denoted as $\{\mathbf{e}_i\}_{i=1}^{N}$. This yields the following T-MEE empirical objective: \begin{equation}\label{eq: t-mee} \mathcal{L}_{\mathrm{T\text{-}MEE}} = - \log \left( \frac{1}{N^2} \sum_{i=1}^{N} \sum_{j=1}^{N} \exp\left( - \frac{\| \mathbf{e}_i - \mathbf{e}_j \|^2}{2\sigma^2} \right) \right), \end{equation} where $\sigma$ denotes the kernel bandwidth. Building upon the trajectory-level MEE (T-MEE) objective, we introduce a unified weighted formulation that accounts for the varying reliability of action error samples. Let $\{\mathbf{e}_i\}_{i=1}^{N}$ denote the aggregated action error vectors collected across batch, time, and chunk dimensions. Each error sample is assigned a non-negative importance weight based on its magnitude, \[ w_i = \frac{ \exp\!\left(-\|\mathbf{e}_i\|^2 / 2\sigma_w^2\right) }{ \sum_{k=1}^{N} \exp\!\left(-\|\mathbf{e}_k\|^2 / 2\sigma_w^2\right) }, \] which downweights unreliable, high-magnitude errors. Using these weights, we define a weighted trajectory-level MEE objective as \[ \mathcal{L}_{\mathrm{W\text{-}TMEE}} = -\log \sum_{i=1}^{N}\sum_{j=1}^{N} \omega_{ij} \exp\!\left( -\frac{\|\mathbf{e}_i-\mathbf{e}_j\|^2}{2\sigma^2} \right), \] where the weighting scheme $\omega_{ij}$ specifies how individual error samples contribute to the entropy estimate. In particular, setting $\omega_{ij}=\frac{1}{N^2}w_i$ yields an asymmetric, chunk-weighted variant (Cw-TMEE) that emphasizes reliable action chunks while aggregating errors across trajectories, whereas setting $\omega_{ij}=w_i w_j$ results in a symmetric, element-weighted variant (Ew-TMEE) that emphasizes pairwise interactions between reliable error elements. Together, these two variants provide flexible mechanisms for incorporating error reliability into trajectory-level entropy minimization. Please refer to the article for insights into how and why T-MEE works.

Contributions

  • We reformulate action prediction errors as structured distributions and introduce trajectory-level MEE objectives for VLA models, including three variants that explicitly model higher-order error interactions along trajectories.
  • We provide a unified investigation across theory, simulation, and real-world robotic experiments, in which analytical results, controlled simulations, and physical evaluations play mutually reinforcing roles, yielding a coherent picture of the effectiveness and limitations of the proposed approach.

Experiments

Evaluated Models

Description of the image
Architectural taxonomy of continuous-action VLA models evaluated in this work.

Nearly Balance Setting

1. T-MEE delivers robust gains on LIBERO across all evaluated model scales and VLA frameworks.
Description of the image
2. Experimental results on LIBERO demonstrate that all T-MEE variants consistently surpasses regression baselines across various VLA architectures, with the standard objective providing the most stable and significant performance gains.
Description of the image
3. Real-world evaluations on GR00T N1.5 demonstrate that T-MEE significantly enhances execution stability and success rates, proving that its distribution-level regularization effectively transfers from simulation to physical robotic systems.
Description of the image

GR00T N1.5

GR00T N1.5 + TMEE

Few-shot and Noise Setting

1. On the LIBERO benchmarks, T-MEE significantly enhances data efficiency through distribution-level supervision, consistently outperforming the GR00T baseline across all task suites and data regimes.
Description of the image
Description of the image
Few-shot evaluation on LIBERO under a 0.2 training ratio.
2. T-MEE effectively mitigates the impact of non-Gaussian observation noise and action outliers, significantly enhancing both fine-tuned and zero-shot robustness across various VLA architectures.
Description of the image
Description of the image

Imbalanced Setting

1. Experimental results on LIBERO demonstrate that T-MEE consistently improves performance across various VLA architectures and task types under moderate data imbalance, though its effectiveness diminishes under extreme imbalance ratios where minority task supervision becomes severely insufficient.
Description of the image
Description of the image
2. Evaluation on SimplerEnv reveals that T-MEE consistently improves performance on dominant tasks even under highly long-tailed data distributions, demonstrating that its benefits persist even when minority tasks are under-represented or not explicitly evaluated.
Description of the image

Analysis Experiments

1. While T-MEE effectively reshapes action errors to cluster near zero, smaller models still exhibit instantaneous outliers during early trajectory phases due to capacity limits, yet they successfully maintain correct directional trends for task completion.
Description of the image
2. Since T-MEE is introduced after 10k training steps, the two curves largely overlap in the early stage. After T-MEE is activated, the entropy under T-MEE decreases more rapidly and converges to a lower level compared to the baseline.
Description of the image

BibTeX

@article{bai2026reshaping,
  title={Reshaping Action Error Distributions for Reliable Vision-Language-Action Models},
  author={Bai, Shuanghao and Wang, DaKai and Chi, Cheng and Zhou, Wanqi and Lyu, Jing and Zhao, Xiaoguang and Wang, Pengwei and Wang, Zhongyuan and Xing, Lei and Zhang, Shanghang and Chen, Badong},
  journal={arXiv preprint arXiv:},
  year={2026}
}