Reshaping Action Error Distributions for Reliable Vision-Language-Action Models
Abstract
In robotic manipulation, vision-language-action (VLA) models have emerged as a promising paradigm for learning generalizable and scalable robot policies. Most existing VLA frameworks rely on standard supervised objectives, typically cross-entropy for discrete actions and mean squared error (MSE) for continuous action regression, which impose strong pointwise constraints on individual predictions. In this work, we focus on continuous-action VLA models and move beyond conventional MSE-based regression by reshaping action error distributions during training. Drawing on information-theoretic principles, we introduce Minimum Error Entropy (MEE) into modern VLA architectures and propose a trajectory-level MEE objective, together with two weighted variants, combined with MSE for continuous-action VLA training. We evaluate our approaches across standard, few-shot, and noisy settings on multiple representative VLA architectures, using simulation benchmarks such as LIBERO and SimplerEnv as well as real-world robotic manipulation tasks. Experimental results demonstrate consistent improvements in success rates and robustness across these settings. Under imbalanced data regimes, the gains persist within a well-characterized operating range, while incurring negligible additional training cost and no impact on inference efficiency. We further provide theoretical analyses that explain why MEE-based supervision is effective and characterize its practical range.
Introduction
What is Minimum Error Entropy?
-
In a standard regression setting, an input $x$ is mapped to an output by a parametric model $f_\theta$.
Let $y \in \mathbb{R}^d$ denote the ground-truth target, $\hat y = f_\theta(x)$ the prediction, and $e = y - \hat y$ the prediction error.
The Minimum Error Entropy (MEE) principle learns model parameters by minimizing the entropy of the error distribution.
When instantiated with Shannon entropy~\cite{shannon1948mathematical}, the objective is
\begin{equation}
\min_{\theta} \; H(e)
= - \int p(e) \log p(e)\, de ,
\end{equation}
where $p(e)$ denotes the probability density of the error variable.
How to adapt MEE to VLA?
-
In VLA models, let $\hat{\mathbf{a}}_{b,k}^{\,t} \in \mathbb{R}^D$ and
$\mathbf{a}_{b,k}^{\,t}$ denote the predicted and ground-truth actions at the
$k$-th step of an action chunk generated at time $t$ for trajectory $b$,
with chunk size $K$.
We define the action prediction error as
$\mathbf{e}_{b,k}^{\,t} = \hat{\mathbf{a}}_{b,k}^{\,t} - \mathbf{a}_{b,k}^{\,t}$.
Rather than treating errors independently, we aggregate errors across batch,
time, and chunk dimensions and regard them as samples from a shared error
distribution.
Formally,
\begin{equation}
\mathcal{E} =
\left\{
\mathbf{e}_{b,k}^{\,t}
\;\middle|\;
b=1,\dots,B,\;
t=1,\dots,T,\;
k=0,\dots,K-1
\right\}.
\end{equation}
We then apply the quadratic Rényi entropy to the aggregated action error distribution.
Specifically, all action error vectors $\mathbf{e}_{b,k}^{\,t}$ across the batch, temporal, and chunk dimensions are flattened into a set of
$N = B \times T \times K$ samples, denoted as $\{\mathbf{e}_i\}_{i=1}^{N}$.
This yields the following T-MEE empirical objective:
\begin{equation}\label{eq: t-mee}
\mathcal{L}_{\mathrm{T\text{-}MEE}} =
- \log \left(
\frac{1}{N^2}
\sum_{i=1}^{N}
\sum_{j=1}^{N}
\exp\left(
- \frac{\| \mathbf{e}_i - \mathbf{e}_j \|^2}{2\sigma^2}
\right)
\right),
\end{equation}
where $\sigma$ denotes the kernel bandwidth.
Building upon the trajectory-level MEE (T-MEE) objective, we introduce a unified
weighted formulation that accounts for the varying reliability of action error
samples. Let $\{\mathbf{e}_i\}_{i=1}^{N}$ denote the aggregated action error vectors
collected across batch, time, and chunk dimensions. Each error sample is assigned
a non-negative importance weight based on its magnitude,
\[
w_i =
\frac{
\exp\!\left(-\|\mathbf{e}_i\|^2 / 2\sigma_w^2\right)
}{
\sum_{k=1}^{N}
\exp\!\left(-\|\mathbf{e}_k\|^2 / 2\sigma_w^2\right)
},
\]
which downweights unreliable, high-magnitude errors.
Using these weights, we define a weighted trajectory-level MEE objective as
\[
\mathcal{L}_{\mathrm{W\text{-}TMEE}}
=
-\log
\sum_{i=1}^{N}\sum_{j=1}^{N}
\omega_{ij}
\exp\!\left(
-\frac{\|\mathbf{e}_i-\mathbf{e}_j\|^2}{2\sigma^2}
\right),
\]
where the weighting scheme $\omega_{ij}$ specifies how individual error samples
contribute to the entropy estimate.
In particular, setting $\omega_{ij}=\frac{1}{N^2}w_i$ yields an asymmetric,
chunk-weighted variant (Cw-TMEE) that emphasizes reliable action chunks while
aggregating errors across trajectories, whereas setting
$\omega_{ij}=w_i w_j$ results in a symmetric, element-weighted variant (Ew-TMEE)
that emphasizes pairwise interactions between reliable error elements.
Together, these two variants provide flexible mechanisms for incorporating
error reliability into trajectory-level entropy minimization.
Please refer to the article for insights into how and why T-MEE works.
Contributions
- We reformulate action prediction errors as structured distributions and introduce trajectory-level MEE objectives for VLA models, including three variants that explicitly model higher-order error interactions along trajectories.
- We provide a unified investigation across theory, simulation, and real-world robotic experiments, in which analytical results, controlled simulations, and physical evaluations play mutually reinforcing roles, yielding a coherent picture of the effectiveness and limitations of the proposed approach.
Experiments
Evaluated Models
Nearly Balance Setting
GR00T N1.5
Task 1: Place banana
Task 2: Wipe whiteboard
Task 3: Handover cup
Task 4: Handover cup (unseen)
GR00T N1.5 + TMEE
Task 1: Place banana
Task 2: Wipe whiteboard
Task 3: Handover cup
Task 4: Handover cup (unseen)
Few-shot and Noise Setting
Imbalanced Setting
Analysis Experiments
BibTeX
@article{bai2026reshaping,
title={Reshaping Action Error Distributions for Reliable Vision-Language-Action Models},
author={Bai, Shuanghao and Wang, DaKai and Chi, Cheng and Zhou, Wanqi and Lyu, Jing and Zhao, Xiaoguang and Wang, Pengwei and Wang, Zhongyuan and Xing, Lei and Zhang, Shanghang and Chen, Badong},
journal={arXiv preprint arXiv:},
year={2026}
}