Skip to content

Paper Analysis: Robotic World Model

This page is the paper-faithful description of Robotic World Model: A Neural Network Simulator for Robust Policy Optimization in Robotics by Li, Krause, and Hutter (ETH Zurich, arXiv:2501.10100v5). It uses the paper's notation verbatim and includes the equations and quantitative results from the paper.

The implementation as found in the upstream codebase is described separately on the Implementation Analysis page. The mapping between paper claims and code (including discrepancies) is on the Paper-to-Code Synthesis page. This page makes no claims about implementation; it describes only what the paper says.

This page records the current reading of the paper. Exact figure and table values should be rechecked against the paper PDF before they are used as final reproduction targets.

1. Problem and motivation

1.1 The gap RWM addresses

Reinforcement learning has produced strong results in robotic control, but the dominant model-free methods (PPO, SAC) require enormous amounts of environment interaction. For locomotion-scale tasks, training a useful policy can demand on the order of \(10^8\) environment steps. This is acceptable in simulation, where steps are cheap and parallel environments are unlimited. It is impractical on real hardware, where each step is a physical interaction that takes time, risks damage, and produces only one trajectory at a time.

A learned world model offers a different trade-off. If the model can predict environment dynamics accurately enough, it acts as a learned simulator. The policy can then be optimized inside the model rather than against the real environment, drastically reducing the number of real interactions needed.

Two long-standing problems have prevented this from working reliably for low-level robotic control:

  1. Compounding error. When a learned model is used autoregressively (its own predictions become inputs for the next prediction), small errors at one step amplify into large errors at later steps. Long-horizon rollouts diverge from reality.
  2. Inductive bias trade-off. Many existing world-model methods incorporate domain-specific structure (rigid-body dynamics, foot-placement priors, frequency-domain parameterization) to constrain the model. This makes them more accurate within the chosen domain but limits their applicability across robotic platforms and tasks.

RWM proposes a framework that achieves long-horizon prediction stability without requiring domain-specific inductive biases, then uses it for policy optimization through a method the paper calls MBPO-PPO.

1.2 Stated contributions

The paper lists three contributions:

  1. A network architecture and training framework that produces world models capable of stable long autoregressive rollouts.
  2. A comparative evaluation of RWM against MLP, RSSM, and transformer baselines across diverse robotic environments.
  3. A policy optimization framework, MBPO-PPO, that uses the learned world model as a simulator and demonstrates zero-shot deployment of the resulting policy on ANYmal D and Unitree G1 hardware.

2. Foundational concepts

The paper assumes familiarity with several concepts. They are defined here for the page to stand on its own.

2.1 POMDP

A Partially Observable Markov Decision Process is the formal model the paper adopts. It is the tuple \((\mathcal{S}, \mathcal{A}, \mathcal{O}, T, R, \Omega, \gamma)\) where:

  • \(\mathcal{S}\) is the state space (the true configuration of the environment, possibly hidden from the agent).
  • \(\mathcal{A}\) is the action space.
  • \(\mathcal{O}\) is the observation space (what the agent actually sees).
  • \(T : \mathcal{S} \times \mathcal{A} \to \mathcal{S}\) is the transition kernel, capturing \(p(s_{t+1} \mid s_t, a_t)\).
  • \(R : \mathcal{S} \times \mathcal{A} \times \mathcal{S} \to \mathbb{R}\) is the reward function.
  • \(\Omega : \mathcal{S} \to \mathcal{O}\) is the observation kernel, capturing \(p(o_t \mid s_t)\).
  • \(\gamma \in [0, 1]\) is the discount factor.

The agent does not see \(s_t\) directly; it sees \(o_t\), which depends probabilistically on \(s_t\). Partial observability is the situation where \(o_t\) does not contain all the information needed to predict the next state. In a robot, this happens when the relevant state includes things like contact forces, internal joint friction, or terrain composition that the proprioceptive observations do not directly measure.

2.2 Policy and return

A policy \(\pi_\theta : \mathcal{O} \to \mathcal{A}\) maps observations to actions. It is parameterized by \(\theta\) (typically the weights of a neural network). The agent's goal is to maximize the expected discounted return:

\[ \mathbb{E}_{\pi_\theta} \left[ \sum_{t \geq 0} \gamma^t r_t \right] \]

where \(r_t\) is the reward at time \(t\) and the expectation is over trajectories generated by \(\pi_\theta\).

2.3 World model

A world model is a learned approximation of the environment dynamics, denoted \(p_\phi\) with parameters \(\phi\). Given a sequence of past observations and actions, it predicts a distribution over the next observation. Once trained, it can be used as a learned simulator: instead of stepping the real environment, the agent steps the model.

Training a world model and then optimizing a policy inside it is called learning in imagination. The standard pattern:

  1. Collect data from the real environment.
  2. Train the world model on that data.
  3. Use the world model to generate trajectories ("imagine") and update the policy on those imagined trajectories.

The reason this can work better than naive model-free RL is that the world model amortizes real interactions: one batch of real data trains the model once, and then the model produces unlimited "imagined" data for policy updates.

2.4 Autoregressive prediction

A model is autoregressive when its outputs become its inputs at the next step. For a world model: the prediction \(\hat{o}_{t+1}\) becomes part of the input that produces \(\hat{o}_{t+2}\), which becomes part of the input that produces \(\hat{o}_{t+3}\), and so on.

Autoregressive prediction is unavoidable for long-horizon rollouts because no real future observation is available; the model must consume its own predictions. This is exactly where compounding error arises.

2.5 Teacher forcing

A common alternative to autoregressive training is teacher forcing: during training, the model receives ground-truth observations as inputs at every step rather than its own predictions. This makes training fast (everything parallelizes) and accurate one step ahead, but the model is never exposed to its own prediction errors during training. At inference time, when it must consume its own predictions, error compounds quickly.

Teacher forcing can be viewed as a special case of autoregressive training with forecast horizon \(N = 1\).

2.6 Privileged information

In robotics, the policy at deployment time often has access to fewer signals than the policy during training. For example, in simulation, contact states or terrain parameters are exactly known. On hardware, they are not. Privileged information refers to such signals: things known during training but not at deployment.

In RWM, privileged information is used as an auxiliary prediction target. The world model is trained to predict these signals (contacts, terminations) in addition to predicting the next observation. The paper argues this implicitly forces the latent representation to encode information that supports accurate long-horizon rollouts, even though the predicted privileged signals themselves are not used at deployment.

3. RWM: the world model

3.1 Setup and notation

The world model \(p_\phi\) takes as input a window of \(M\) historical observation-action pairs and predicts the distribution of the next observation. The predicted observation \(k\) steps into the future is generated by:

\[ o'_{t+k} \sim p_\phi\left(\cdot \mid o_{t-M+k:t},\; o'_{t+1:t+k-1},\; a_{t-M+k:t+k-1}\right) \quad (1) \]

where:

  • \(o_{t-M+k:t}\) is the real observation history available at step \(t\), sliding forward as \(k\) increases.
  • \(o'_{t+1:t+k-1}\) is the predicted observation sequence the model has already produced for steps \(t+1\) through \(t+k-1\). These are autoregressive inputs.
  • \(a_{t-M+k:t+k-1}\) is the action sequence (real for \(\leq t\), given by the policy for \(> t\)).

The notation \(o'\) denotes a predicted (imagined) observation, while \(o\) without prime denotes the real observation. The same convention applies to predicted privileged information \(c'\) versus real privileged information \(c\).

Privileged information \(c\) (contacts, terminations) is predicted in parallel using the same input pattern.

3.2 The training objective

The world model is trained by minimizing the multi-step prediction error:

\[ L = \frac{1}{N} \sum_{k=1}^{N} \alpha^k \left[ L_o(o'_{t+k}, o_{t+k}) + L_c(c'_{t+k}, c_{t+k}) \right] \quad (2) \]

where:

  • \(N\) is the forecast horizon: how many steps into the future the model is trained to predict.
  • \(\alpha\) is a forecast decay factor that down-weights losses at later forecast steps. The paper's reported configuration uses \(\alpha = 1.0\) (no decay), so all forecast steps contribute equally; the decay factor is a hyperparameter that allows tuning if later steps should matter less.
  • \(L_o\) measures the discrepancy between predicted and true observations.
  • \(L_c\) measures the discrepancy between predicted and true privileged information (contacts, terminations).

The loss is averaged over the forecast horizon \(N\). Training data is constructed by sliding a window of size \(M + N\) over collected real trajectories, which gives both the historical context and the prediction targets.

3.3 The dual-autoregressive mechanism

This is the central architectural contribution of the paper. Two distinct autoregressions are nested.

Inner autoregression (across history)

The GRU base of the world model processes the historical observation-action pairs one step at a time. After each historical step, the GRU's hidden state \(h_t\) is updated to summarize the history seen so far. This is the same recurrence used by any GRU language model or sequence model: the hidden state carries context forward.

Outer autoregression (across forecast)

When the model predicts \(o'_{t+1}\), that prediction is appended to the input window for predicting \(o'_{t+2}\). The window slides forward, but where a teacher-forced model would have \(o_{t+1}\) (the true observation) in that slot, the autoregressive model has \(o'_{t+1}\) (its own prediction). Each subsequent prediction conditions on the model's own previous predictions over the forecast horizon \(N\).

The combination is what the paper calls dual-autoregressive: inner autoregression updates the GRU state across history, outer autoregression feeds predictions back over the forecast horizon. Both happen during training. The key is that during training, the model is exposed to its own prediction distribution (via the outer autoregression), which closes the gap between the training distribution and the deployment distribution.

Figure 2 in the paper

Figure 2 contrasts autoregressive training (Fig. 2a) with teacher-forcing training (Fig. 2b) for a setup with history horizon \(H = 3\) and forecast horizon \(N = 2\).

In Figure 2a (autoregressive training): the model uses real observations \(o_{t-2}, o_{t-1}, o_t\) and produces a prediction \(o'_{t+1}\). Then \(o'_{t+1}\) is fed back as input for predicting \(o'_{t+2}\). The dashed arrows in the figure denote these autoregressive prediction steps. The loss \(L_o\) is computed at both forecast steps. The next training instance shifts forward by one step, now using \(o_{t-1}, o_t, o_{t+1}\) as history (with the real \(o_{t+1}\) from the dataset).

In Figure 2b (teacher-forcing): the model still uses the same history, but for predicting \(o'_{t+2}\), the real \(o_{t+1}\) is used as input rather than the predicted \(o'_{t+1}\). This is the structural simplification that breaks parallelism in (a) but reduces (b) to a one-step-prediction problem at every position. The loss is computed only at the next-step position \(t+1\). With \(N = 1\), autoregressive training reduces exactly to teacher-forcing.

The figure makes one point clearly: in (a), the model is trained on its own outputs and learns to recover from its own errors. In (b), it is not.

Figure S6 in the appendix

Figure S6 visualizes the dual aspect explicitly. It shows the GRU hidden states \(h_{t-2}, h_{t-1}, h_t, h_{t+1}, h_{t+2}\) being updated step-by-step (the inner autoregression across history and across the forecast steps), with the predicted observations \(o'_{t+1}, o'_{t+2}, o'_{t+3}\) feeding back as inputs to subsequent hidden state updates (the outer autoregression). The figure shows two "rolls" of this scheme: one starting from step \(t\) and one starting from step \(t+1\), illustrating how the training data window slides.

3.4 Architecture

The architecture combines a recurrent base with prediction heads (Table S7 in the paper):

base               GRU         hidden 256, 256
heads              MLP         hidden 128, ReLU

The GRU base ingests observation-action pairs sequentially and produces a hidden representation. MLP heads on top of this representation predict the mean and standard deviation of a Gaussian distribution over the next observation, plus the privileged information (contacts, terminations).

The paper notes the framework is architecture-agnostic. RSSM (with autoregressive training) achieves comparable performance, but the GRU is selected for its simplicity and computational efficiency. Transformers do not scale well under autoregressive training because the multi-step gradient propagation exhausts GPU memory.

3.5 Reparameterization for gradient propagation

The model predicts a Gaussian distribution over the next observation. Sampling from this distribution is non-differentiable in general, which would block gradients from flowing through autoregressive steps. The paper applies the reparameterization trick: instead of sampling \(o'_{t+1}\) directly, the model produces \(o'_{t+1} = \mu_\phi + \sigma_\phi \cdot \epsilon\) where \(\epsilon \sim \mathcal{N}(0, I)\) is sampled independently. The gradient flows through \(\mu_\phi\) and \(\sigma_\phi\) but bypasses \(\epsilon\), enabling end-to-end training across the forecast horizon.

4. MBPO-PPO: policy optimization

4.1 The algorithm

Algorithm 1 in the paper:

1: Initialize policy π_θ, world model p_φ, replay buffer D
2: for learning iterations = 1, 2, ... do
3:    Collect (o, a) pairs in D by interacting with the environment using π_θ
4:    Update p_φ with autoregressive training using data from D, per Eq. 2
5:    Initialize imagination agents with observations sampled from D
6:    Roll out imagination trajectories using π_θ and p_φ for T steps, per Eq. 3
7:    Update π_θ using PPO (or another RL algorithm)
8: end for

The structure is:

  • Real interactions go into a replay buffer \(\mathcal{D}\).
  • The world model \(p_\phi\) is trained on \(\mathcal{D}\) using the multi-step autoregressive loss (Eq. 2).
  • Imagination agents are initialized from observations sampled out of \(\mathcal{D}\). Each imagination agent then rolls forward for \(T\) steps, using the policy \(\pi_\theta\) to choose actions and the world model \(p_\phi\) to produce next observations.
  • The policy is updated using PPO on the imagined trajectories.

4.2 Imagined action selection

During imagination, the policy chooses actions conditioned on the imagined observations:

\[ a'_{t+k} \sim \pi_\theta(\cdot \mid o'_{t+k}) \quad (3) \]

where \(o'_{t+k}\) is drawn autoregressively per Eq. 1. The policy does not know whether the observation is imagined or real; it sees the same observation space either way.

Rewards in imagination are computed from the imagined observations and predicted privileged information, not from a ground-truth simulator. This is the operational meaning of "the world model acts as a simulator": it produces both states and the signals needed to compute rewards, so the PPO update can run end-to-end without ever stepping the real environment during the imagination phase.

4.3 The 100-step rollout claim

The paper emphasizes that MBPO-PPO optimizes policies over roughly one hundred autoregressive steps of imagination per iteration, far exceeding what existing methods (MBPO, Dreamer, TD-MPC) achieve. This is enabled directly by RWM's stability over long autoregressive rollouts. Without that stability, longer rollouts would accumulate enough prediction error to make the imagined trajectories useless for policy updates.

The paper's appendix reports the training parameters in Table S11: approximately 4096 imagination environments, 100 imagination steps per iteration, buffer size \(|\mathcal{D}| = 1000\), KL target 0.01, clip range \(\epsilon = 0.2\), discount \(\gamma = 0.99\), 2500 total iterations. These exact numbers should be rechecked against the paper PDF.

4.4 Figure S7 in the paper

Figure S7 visualizes the MBPO-PPO loop. It shows:

  • Real environment interactions on the left, producing observation-action pairs that flow into the replay buffer.
  • Imagination rollouts on the right, where the world model produces \(o'_{t+1}, o'_{t+2}, \ldots, o'_{t+T}\) given the policy's actions \(a_t, a_{t+1}, \ldots\).
  • The policy is updated from the imagined trajectories.

The figure makes explicit that the policy never sees real \(o_{t+1:T}\) during the imagination phase; it only sees \(o'_{t+1:T}\) from the world model.

5. Empirical results

5.1 Autoregressive prediction accuracy

The paper trains RWM with \(M = 32\) history steps and \(N = 8\) forecast steps on ANYmal D trajectories at 50 Hz control frequency. Figure 3a shows predicted trajectories closely tracking ground truth across all observed variables (linear velocity, angular velocity, joint positions). Predictions extend reliably beyond the training forecast horizon of 8 steps, indicating the autoregressive training mechanism extrapolates rather than memorizes.

Figure 3b reports prediction error under Gaussian noise injection (six noise scales: 0.0, 0.1, 0.2, 0.4, 0.5, 0.8). RWM (yellow curves) maintains low prediction error across all noise levels and forecast steps. The MLP baseline (grey curves) diverges: prediction error grows rapidly with forecast step and is sharply worse under any non-zero noise.

5.2 Comparison across architectures

The paper reports autoregressive prediction error \(e\) across diverse robotic environments (manipulation, quadruped locomotion, humanoid locomotion) for four model classes:

  1. MLP baseline.
  2. RSSM (Recurrent State Space Model, the architecture used by Dreamer).
  3. Transformer.
  4. RWM trained with teacher-forcing (RWM-TF).
  5. RWM trained with autoregressive training (RWM-AR).

The result: RWM-AR consistently achieves the lowest prediction error across all environments. The gap is most pronounced on legged locomotion tasks. Critically, RWM-AR significantly outperforms RWM-TF, isolating the contribution of autoregressive training as distinct from the GRU architecture.

When RSSM is also trained with the paper's autoregressive scheme, it matches the GRU-based RWM. The paper still chooses the GRU model for simplicity and computational efficiency. Transformers cannot be trained autoregressively at scale because of GPU memory limits during multi-step gradient propagation.

5.3 Policy learning and hardware deployment

Section 4.4 reports policy training results on ANYmal D and Unitree G1 velocity tracking tasks. MBPO-PPO is compared against:

  • SHAC (Short-Horizon Actor-Critic): a first-order gradient method that backpropagates through the world model.
  • DreamerV3: a latent-space dynamics model with actor-critic.

Figure 5 plots model error and policy reward over training iterations. Findings:

  • MBPO-PPO reduces model error consistently and converges to high reward.
  • SHAC has high and fluctuating model error throughout. The paper attributes this to the discontinuous dynamics of legged locomotion (contact patterns change abruptly), which produces inaccurate gradients for first-order methods. The policy collapses and fails to converge.
  • Dreamer partially converges but lags MBPO-PPO substantially. Its shorter planning horizon limits handling of long-horizon dependencies.

For hardware deployment: SHAC and Dreamer fail to produce a deployable policy due to training collapse. MBPO-PPO produces a policy that deploys zero-shot on both ANYmal D and Unitree G1, tracking velocity commands and maintaining stability under external disturbances and varying terrain.

5.4 Comparison against well-tuned model-free PPO

The paper compares RWM+MBPO-PPO against a well-tuned PPO baseline trained on a high-fidelity simulator (Table 1 in the paper):

Method State transitions Total training time Step inference Real tracking reward
RWM pretraining 6 M 50 min n/a n/a
MBPO-PPO n/a 5 min 1 ms 0.90 ± 0.04
PPO 250 M 10 min 1 ms 0.90 ± 0.03

The two methods achieve statistically equivalent real tracking rewards. The trade-off is sample efficiency versus end-to-end training time:

  • RWM+MBPO-PPO uses 6 M state transitions for world model pretraining plus an unstated number of real interactions during MBPO-PPO. Total training time is 50 + 5 = 55 minutes.
  • Well-tuned PPO uses 250 M state transitions but trains in 10 minutes total (because all those transitions happen in parallel simulators, very fast).

The paper is candid: when unlimited near-perfect simulation is available, model-free PPO is competitive in wall-clock time while using vastly more state transitions. The strength of MBRL is in scenarios where simulation is the bottleneck.

5.5 Ablation: history horizon and forecast horizon

Section A.4.1 ablates the choice of \(M\) and \(N\). Figure S8 reports:

  • Left heatmap (prediction error): longer history horizons \(M\) consistently reduce error, with diminishing returns past \(M = 16\). Longer forecast horizons \(N\) also reduce error, more dramatically. At \(N = 1\) (teacher-forcing), prediction errors are an order of magnitude higher than at \(N = 8\).
  • Right heatmap (training time): longer \(N\) values increase training time substantially because of the sequential autoregressive computation. \(N = 1\) is fastest because it parallelizes fully.

The paper selects \(M = 32\), \(N = 8\) as the operating point that balances accuracy and training cost. This justifies the choice that appears in the codebase configuration.

6. Reward formulation

The reward function used by the velocity tracking task on ANYmal D and Unitree G1 is specified in Appendix A.1.2 of the paper. It is a weighted sum of the following terms. The weights \(w_{\cdot}\) are reported in Table S6 of the paper and differ slightly between the two robots; the equations themselves are identical.

Linear velocity tracking (\(x, y\)):

\[ r_{v_{xy}} = w_{v_{xy}} \, e^{-\|c_{xy} - v_{xy}\|_2^2 \,/\, \sigma_{v_{xy}}^2} \]

with \(\sigma_{v_{xy}} = 0.25\), where \(c_{xy}\) is the commanded base linear velocity and \(v_{xy}\) is the current base linear velocity. Exponential shaping rewards small tracking errors heavily and saturates for large errors.

Angular velocity tracking (\(z\)):

\[ r_{\omega_z} = w_{\omega_z} \, e^{-\|c_z - \omega_z\|_2^2 \,/\, \sigma_{\omega_z}^2} \]

with \(\sigma_{\omega_z} = 0.25\), where \(c_z\) is the commanded yaw rate and \(\omega_z\) is the current yaw rate.

Linear velocity penalty (\(z\)):

\[ r_{v_z} = w_{v_z} \, \|v_z\|_2^2 \]

penalizing vertical base motion. The weight \(w_{v_z} = -2.0\) in the paper, so this term is negative.

Angular velocity penalty (\(x, y\)):

\[ r_{\omega_{xy}} = w_{\omega_{xy}} \, \|\omega_{xy}\|_2^2 \]

penalizing roll and pitch rates with \(w_{\omega_{xy}} = -0.05\).

Joint torque penalty:

\[ r_{q\tau} = w_{q\tau} \, \|\tau\|_2^2 \]

penalizing actuator effort with \(w_{q\tau} = -2.5 \times 10^{-5}\).

Joint acceleration penalty:

\[ r_{\ddot{q}} = w_{\ddot{q}} \, \|\ddot{q}\|_2^2 \]

penalizing high joint accelerations with \(w_{\ddot{q}} = -2.5 \times 10^{-7}\).

Action rate penalty:

\[ r_{\dot{a}} = w_{\dot{a}} \, \|a' - a\|_2^2 \]

penalizing rapid changes in commanded actions (\(a\) current, \(a'\) previous) with \(w_{\dot{a}} = -0.01\) on ANYmal D and \(-0.05\) on Unitree G1.

Feet air time bonus:

\[ r_{f_a} = w_{f_a} \, t_{f_a} \]

where \(t_{f_a}\) is the cumulative time the feet spend in the air. This term encourages a stepping gait rather than dragging or shuffling. Weight \(w_{f_a} = 0.5\) on ANYmal D, \(0.0\) on Unitree G1.

Undesired contacts penalty:

\[ r_c = w_c \, c_u \]

where \(c_u\) counts contacts at undesired body parts (e.g., knees), with \(w_c = -1.0\).

Flat orientation penalty:

\[ r_g = w_g \, \|g_{xy}\|^2 \]

where \(g_{xy}\) is the projection of gravity onto the robot's \(xy\) plane. The term is zero when the base is upright and grows quadratically with tilt. Weight \(w_g = -5.0\).

Foot clearance bonus:

\[ r_{f_c} = w_{f_c} \, h_{f_c} \]

where \(h_{f_c}\) is the clearance height of the swing feet. Weight \(w_{f_c} = 0.0\) on ANYmal D (effectively disabled) and \(1.0\) on Unitree G1.

Joint deviation penalty:

\[ r_{q_d} = w_{q_d} \, \|q - q_0\|_1 \]

where \(q_0\) is the default joint configuration. The \(L_1\) norm penalizes large deviations from the nominal pose. Weight \(w_{q_d} = 0.0\) on ANYmal D, \(-1.0\) on Unitree G1.

The total task reward is the sum of all terms. The weights are tuned per-robot, which means the policy is shaped slightly differently for ANYmal D versus Unitree G1 even though the underlying objective structure is identical.

These rewards are computed from observable signals (base velocities, joint states, action history) and from privileged contact information that the world model is trained to predict. Inside the MBPO-PPO imagination loop, the rewards are reconstructed from RWM's predicted observations and predicted privileged information rather than queried from the simulator.

7. Limitations stated by the paper

Section 5 of the paper acknowledges three open issues, all of which are directly relevant to the project's trajectory.

6.1 Underperformance versus tuned model-free PPO on high-fidelity simulators

When near-perfect simulation is available and parallelizable, well-tuned model-free PPO matches or exceeds MBPO-PPO in wall-clock time. The MBRL advantage shows up specifically in the regime where real interactions are expensive or simulation fidelity is limited, which is precisely the deployment regime the paper targets.

6.2 Reliance on simulation pretraining

RWM is pretrained on simulation data before policy optimization begins. The paper notes that training the world model from scratch on online data (only real interactions) is unstable, because an immature policy generates low-value transitions and the model overfits to a narrow distribution. Pretraining stabilizes training but does not eliminate the need for it.

6.3 Online learning on hardware

Real hardware online learning would let the world model and policy continue to refine after deployment. The paper acknowledges this is the long-term goal but identifies several blockers: policies exploit small model errors and produce dangerous behavior, recovery policies are not available for large platforms (ANYmal D, Unitree G1), and privileged information used during training (contact forces) may not be measurable on hardware. The paper explicitly notes that uncertainty-aware world models could be explored as a way to mitigate error exploitation but require additional architectural changes.

The third limitation is the direct motivation for the follow-up paper, which is the subject of a separate analysis.

8. Summary of paper claims

Concretely, what the paper claims and what is tested:

  1. The dual-autoregressive training scheme produces a world model that predicts robotic dynamics stably over horizons substantially longer than the training forecast horizon.
  2. Trained with autoregressive training, the GRU-based architecture matches or beats RSSM and transformer baselines on prediction error.
  3. MBPO-PPO uses RWM as a learned simulator and produces policies that match well-tuned model-free PPO on the velocity tracking task while using a fraction of the state transitions.
  4. Policies trained inside RWM deploy zero-shot on ANYmal D and Unitree G1 hardware, beating SHAC and DreamerV3 baselines that fail to produce deployable policies.
  5. Ablation supports the choice of \(M = 32\), \(N = 8\) as a balance between accuracy and training cost.

These claims are the targets against which the project's reproduction efforts on lab hardware will be measured. Per-claim verification status is tracked on the Reproduction Status page.