Paper-to-Code Synthesis¶

This page aligns the RWM paper's claims with their implementation in the upstream codebase, surfaces the discrepancies between the two, and serves as the cross-reference between the Paper Analysis and Implementation Analysis pages.

The two source pages should be treated as authoritative for their respective domains: paper-analysis for what the paper claims, implementation-analysis for what the code does. This page neither restates them nor adds new claims; it maps one onto the other.

How to read this page¶

Each row of the synthesis tables uses the mapping convention from the landing page:

Mapped: paper concept identified in code.
Partially mapped: identified with documented gaps.
Discrepancy noted: code diverges from paper in a documented way.
Not mapped: no code location identified.

The discrepancies are the higher-leverage findings of the analysis. They are summarized in Section 4 of this page after the mapping tables.

1. Method components¶

Paper concept	Paper section	Code location	Mapping status
World model \(p_\phi\)	§3.1, Eq. 1	`rsl_rl/modules/system_dynamics.py` (`SystemDynamicsEnsemble`)	Mapped
GRU recurrent base	§3.2, Table S7	`rsl_rl/modules/architectures/rnn.py`	Mapped
State prediction head (mean and std)	§3.2	`rsl_rl/modules/architectures/mlp.py`	Mapped (with discrepancy on residual form, see §4.1)
Privileged information prediction (contacts, terminations)	§3.1, Table S3	Auxiliary heads in `system_dynamics.py`	Mapped
Inner autoregression (GRU update across history)	§3.2, Fig. S6	GRU recurrent forward pass in `rnn.py`	Mapped
Outer autoregression (predicted observations fed back over forecast horizon)	§3.2, Fig. 2a	Forecast loop in `system_dynamics.py` `compute_loss(...)`	Mapped
Multi-step prediction loss (Eq. 2)	§3.2	`compute_loss(...)` aggregation in `system_dynamics.py`	Partially mapped (discrepancy on \(L_o\) form, see §4.2)
Reparameterization trick	§3.2	\(\mu_\phi + \sigma_\phi \cdot \epsilon\) in MLP state head	Mapped
Forecast decay \(\alpha\)	§3.2, Eq. 2	No corresponding parameter in config; loss aggregation uses uniform mean over forecast steps	Partially mapped: code's `torch.mean(...)` over forecast steps is equivalent to paper's \(\alpha = 1.0\), but \(\alpha\) is not a configurable hyperparameter in the implementation
MBPO-PPO algorithm (Algorithm 1)	§3.3	`mbpo_on_policy_runner.py` `learn(...)`	Mapped
Imagined action selection (Eq. 3)	§3.3	Policy invocation inside `imagination_step(...)`	Mapped
Imagined reward computation	§3.3, §A.1.2	`_compute_imagination_reward_terms(...)` in `manager_based_mbrl_env.py`	Mapped
Replay buffer \(\mathcal{D}\)	§3.3	`rsl_rl/storage/`	Mapped
Autoregressive evaluation under noise injection	§4.1, Fig. 3b	`evaluate_system_dynamics(...)` in `mbpo_ppo.py`, with `system_dynamics_eval_traj_noise_scale = [0.1, 0.2, 0.4, 0.5, 0.8]`	Mapped: the code rolls out 100 trajectories of length 400 and evaluates per-trajectory autoregressive prediction error at the noise scales reported in the paper's figure
Imagination rollout (paper: 100 autoregressive steps per iteration)	§3.3, Table S11	`imagine()` runs `for i in range(num_imagination_steps_per_env)` with `num_imagination_steps_per_env = 24`; 8192 imagination envs in parallel	Discrepancy in scale: code uses 24-step autoregressive horizon, paper reports 100. Documented in §4.3.
Policy observation versus system state	Paper notation uses observation broadly	`policy` group, `system_state` group, and the imagination env's policy-observation reconstruction	Partially mapped: the dynamics model predicts `system_state`, while the policy observation is reconstructed from predicted state plus command and previous action

2. Reward terms¶

The paper defines a reward structure for the locomotion task in Section A.1.2. The implementation does not redefine these rewards from scratch; it inherits the standard Isaac Lab locomotion reward set from velocity_env_cfg.RewardsCfg and applies project-specific modifications. The active code-level reward set is therefore not identical to the paper's full reward table and should be tracked separately.

The 11 reward terms active in the Pretrain-v0 and Finetune-v0 tasks, with their effective weights and source, are:

Active code reward term	Effective weight	Source	Status
`track_lin_vel_xy_exp`	`+1.0`	inherited	mapped
`track_ang_vel_z_exp`	`+0.5`	inherited	mapped
`lin_vel_z_l2`	`-2.0`	inherited	mapped
`ang_vel_xy_l2`	`-0.05`	inherited	mapped
`dof_torques_l2`	`-2.5e-5`	inherited, overridden by project	mapped
`dof_acc_l2`	`-2.5e-7`	inherited	mapped
`action_rate_l2`	`-0.01`	inherited	mapped
`feet_air_time`	`+0.5`	inherited, overridden by project	mapped
`undesired_contacts`	`-1.0`	inherited	mapped
`flat_orientation_l2`	`-5.0`	inherited (default 0.0), overridden by project	mapped
`stand_still`	`-1.0`	added by project	active in code, paper correspondence to verify

dof_pos_limits is configured but inactive (weight 0.0). The Init-v0 baseline task reverts the three flat-terrain overrides and therefore has different effective weights for flat_orientation_l2, dof_torques_l2, and feet_air_time.

Reward terms listed in the paper that are not in this active code set should be treated as paper-level terms until their code activation is verified.

3. Training and architecture parameters¶

Parameter	Paper value	Inspected code value	Status
History horizon \(M\)	32	32	matched
Forecast horizon \(N\)	8	8	matched
Forecast decay \(\alpha\)	1.0	not a configurable parameter; uniform mean over forecast steps	equivalent to \(\alpha = 1.0\)
Step time \(\Delta t\)	0.02 s	0.02 s	matched
GRU hidden size	256	256	matched
GRU layers	2	2	matched
State mean head	128	`[128]`	matched at architecture level
State log-std head	128	`[128]`	matched at architecture level
Contact head	128	`[128]`	matched at architecture level
Termination head	128	`[128]`	matched at architecture level
Ensemble size	to verify for base RWM	1	active code uses single member
Policy learning rate	to verify	`1.0e-3`	code value verified, paper comparison pending
System-dynamics learning rate	to verify	`1.0e-3`	code value verified, paper comparison pending
System-dynamics weight decay	to verify	`0.0`	code value verified, paper comparison pending
System-dynamics mini-batches	to verify	`20`	code value verified
System-dynamics mini-batch size	to verify	`5000`	code value verified
System-dynamics replay buffer size	to verify	`1000`	code value verified
Pretrain max iterations	to verify	`2000`	code value verified
Finetune imagination envs	4096	`8192`	code value verified, paper comparison pending
Finetune imagination steps per env	100 (per iteration in paper)	`24` (per env per iteration in code)	distribution-across-envs question, see §4.3
Discount \(\gamma\)	0.99	`0.99`	matched
PPO clip range \(\epsilon\)	0.2	`0.2`	matched
PPO mini-batches	to verify	`4`	code value verified
PPO learning epochs	to verify	`5`	code value verified
GAE \(\lambda\)	to verify	`0.95`	code value verified
KL target	to verify	`0.01`	code value verified
Value loss coefficient	to verify	`1.0`	code value verified
Max grad norm	to verify	`1.0`	code value verified
LR schedule	to verify	`adaptive`	code value verified
Entropy coefficient	to verify	`0.005`	code value verified
System-dynamics warmup iterations (Pretrain-v0)	to verify	`0`	code value verified
System-dynamics warmup iterations (Finetune-v0)	to verify	`500`	code value verified
System-dynamics loss weights (state, sequence, bound, kl, extension, contact, termination)	to verify	`(1.0, 1.0, 1.0, 0.1, 1.0, 1.0, 1.0)`	code value verified
System-dynamics eval trajectories	to verify	`100`	code value verified
System-dynamics eval trajectory length	to verify	`400`	code value verified
System-dynamics eval noise scales	`[0.0, 0.1, 0.2, 0.4, 0.5, 0.8]` (paper Fig. 3b)	`[0.1, 0.2, 0.4, 0.5, 0.8]`	code-side covers the nonzero noise levels reported in the paper figure

Values marked "to verify" require the paper-side check before being confirmed as matched or noted as discrepancies. The paper values that are listed (history horizon, forecast horizon, GRU dimensions, discount, clip range, imagination envs) are taken from the paper analysis page and remain to be reconfirmed against the paper PDF.

4. Discrepancies¶

The four substantive discrepancies between the paper and the active code path. Each is the basis for a finding worth surfacing to a supervisor.

4.1 State mean predicted as a residual¶

The paper's Eq. 1 specifies the predicted observation distribution \(p_\phi(\cdot \mid \ldots)\) without prescribing whether the mean is predicted directly or as a residual on the most recent state.

The code predicts a residual:

\[ \mu_\phi = \text{MLP}_\mu(h) + x_{\text{state}}^{\text{last}} \]

in normalized state space. The MLP head learns the per-step delta rather than the absolute next state. With \(\Delta t = 0.02\) s, per-step state changes are small, so residual prediction is numerically more stable than absolute prediction. This is a sensible implementation choice not specified in the paper.

Status: Discrepancy noted, but the discrepancy is at the level of "the paper does not specify, the code makes a reasonable choice" rather than "the paper specifies X, the code does Y."

4.2 State loss is sampled MSE, not Gaussian NLL¶

The architecture predicts both a mean \(\mu_\phi\) and a standard deviation \(\sigma_\phi\) for a Gaussian over the next observation. The natural training objective for such an architecture is the Gaussian negative log-likelihood:

\[ L_{\text{NLL}} = \frac{1}{2}\left(\frac{(o_{\text{target}} - \mu_\phi)^2}{\sigma_\phi^2} + 2\log\sigma_\phi\right) + \text{const} \]

The paper's Eq. 2 specifies \(L_o\) abstractly as a discrepancy measure and does not commit to either form. The active code path uses sampled MSE:

\[ L_{\text{MSE}} = \left(o_{\text{target}} - (\mu_\phi + \sigma_\phi \cdot \epsilon)\right)^2, \quad \epsilon \sim \mathcal{N}(0, I) \]

The two losses are related but not equivalent: under reparameterization, sampled MSE is a Monte Carlo estimator that uses the predicted std for sampling but does not penalize \(\sigma_\phi\) in the same way NLL does. NLL drives the model to produce calibrated uncertainty (predicted std matches actual error magnitude); sampled MSE relies on the bound regularization term to keep std meaningful.

The codebase has a helper that supports both modes (compute_regression_loss(loss_type="mse" | "gaussian_nll")); the active call uses loss_type="mse". Switching the configured mode would activate NLL training without further changes.

Under sampled MSE, the predicted standard deviation affects the sampled prediction. In expectation, this penalizes both mean error and large predicted variance, but it is not the same as Gaussian NLL and does not calibrate uncertainty in the same way.

Status: Discrepancy noted. This is the most substantive of the four.

4.3 Imagination autoregressive horizon is shorter than the paper¶

Table S11 of the paper reports 4096 imagination environments and 100 imagination steps per iteration. The Finetune-v0 runner config uses 8192 imagination environments and 24 autoregressive steps per imagination collection.

The accounting is now resolved: num_imagination_steps_per_env = 24 is the literal length of the autoregressive rollout in the runner's imagine() loop. The policy is rolled forward through the world model for 24 steps per imagination collection, then PPO updates against that collection (5 learning epochs × 4 mini-batches = 20 gradient steps per collection). Imagination is performed once per outer training iteration, so over the full 2000-iteration training the cumulative autoregressive exposure is large, but any single PPO update batch sees trajectories of length 24, not 100.

The paper's reported 100 is therefore a longer per-iteration horizon than the code uses. This is unlikely to be a fundamental contradiction (the paper shows in Figure 3a that RWM remains accurate well past 24 steps, so a 24-step rollout is in the model's reliable range), but it is a real divergence in scale that should be tracked. Within the rollout, environments that hit predicted termination are reset by sampling fresh history windows from the system replay buffer, so the rollout effectively re-grounds in real data whenever the model predicts an episode end.

Status: Discrepancy noted in scale. The horizons are different (24 in code vs 100 in the paper), but the values are not in direct contradiction; the code uses a shorter horizon than what the paper reports.

4.4 Uncertainty hooks present but inactive¶

The codebase contains the structural hooks for uncertainty-aware behavior:

ensemble_size parameter on SystemDynamicsEnsemble (default 1, currently 1). With ensemble_size > 1, the implementation instantiates multiple prediction heads sharing the GRU recurrent base, so epistemic uncertainty in this design reflects disagreement across prediction heads rather than across fully independent dynamics networks.
uncertainty_penalty_weight field on the imagination env (default -0.0, currently -0.0).
A penalty term in the imagined reward computation: rewards += uncertainty_penalty_weight * epistemic_uncertainty * step_dt. With a negative weight this becomes a penalty.

The paper introduces no uncertainty machinery. With ensemble_size = 1, epistemic uncertainty is structurally zero; with uncertainty_penalty_weight = -0.0, the penalty term contributes nothing either way.

These hooks are the structural surface for the RWM-U extension (the second paper). The active configuration runs as RWM. Activating the hooks (ensemble_size > 1, uncertainty_penalty_weight < 0) is the path forward to the RWM-U + MOPO-PPO configuration. Whether the shared-base ensemble design is sufficient for RWM-U, or whether the second paper assumes fully independent ensemble members, is an open paper-to-code question to resolve during the RWM-U deep dive.

Status: Discrepancy / extension hook noted, with the framing that this is not a discrepancy from the RWM paper (which does not specify uncertainty handling) but the visible surface of the RWM-U extension, switched off in the active base configuration.

5. Inactive loss components¶

The system-dynamics loss has seven configured terms. Four are active in the current GRU configuration, three are inactive:

Term	Status	Reason
State (sampled MSE)	Active	Default regression mode.
Sequence	Inactive	`prediction_type = "single"` for the GRU path.
Bound	Active	Regularizes learned std bounds.
KL	Inactive	Only used for RSSM architectures.
Extension	Inactive	No `system_extension` observation group registered.
Contact	Active	BCE on thigh and foot contacts.
Termination	Active	BCE on base contact.

The inactive components are scaffolding that supports alternative architectures or task variants. They are not bugs or omissions; they are configuration-dependent code paths.

Status: Partially mapped. The seven-term decomposition is a code-level structure not present in the paper. The four active terms together implement the paper's Eq. 2 plus the bound regularization.

6. Summary¶

The active code path implements the paper's RWM method faithfully in structure, with documented discrepancies in two specific places: the state loss form (sampled MSE rather than Gaussian NLL) and the residual mean prediction. The reward equations from Section A.1.2 of the paper are reconstructed term-by-term inside the imagination loop. The 100-step rollout claim in the paper maps to a different per-env / per-iteration accounting in the code that would need run-time inspection to fully reconcile.

The codebase additionally contains uncertainty hooks reserved for the RWM-U extension. They are switched off in the active configuration and produce RWM behavior. Activating them is the planned path to RWM-U + MOPO-PPO; per-claim status of that activation is tracked on the Reproduction Status page.