ReGuide: From Test-Time Guidance to
Self-Improving Diffusion Policies

Tzu-Hsiang Lin, Srinivas Shakkottai, Dileep Kalathil, P. R. Kumar
Texas A&M University
ReGuide overview

ReGuide turns test-time guidance into reusable on-policy recovery data: it constructs phase targets, rolls out the policy with phase-conditioned guidance, and merges successful guided rollouts back into training — forming an iterative rollout–collect–train loop.

Abstract

Behavior-cloned diffusion policies are expressive but remain vulnerable to covariate shift: small deviations from demonstrated states can compound into task failure. Existing methods address this either by expanding the training distribution through expert corrections or synthetic augmentation, or by steering a frozen policy at test time with guidance from a learned model. The former can be expensive or assumption-dependent, while the latter discards the corrected trajectories after execution.

We introduce ReGuide, a self-improving framework that treats guided rollouts as reusable on-policy recovery data. ReGuide first uses Phase-Conditioned Guidance (PCG) to generate corrective rollouts: it constructs phase-specific latent targets, applies guidance only in the drifted-but-recoverable regime, and guides through the estimated clean action to match the dynamics model's training distribution. Successful guided rollouts are then absorbed back into the policy through ReGuide-FT, which fine-tunes the current checkpoint, or ReGuide-FS, which retrains from scratch on the augmented dataset; the two can also be composed and iterated.

On Robomimic Can, Square, Transport, and Tool Hang, ReGuide improves base-policy success by 1.3–7.7×, outperforms LPB in the test-time-only setting, and matched-data ablations show that the gains come from guided recovery data rather than additional rollouts alone.

Method

ReGuide turns guidance into a data-generation mechanism for a behavior-cloned diffusion policy without collecting additional expert demonstrations. It consists of three coupled components.

1. Phase-Aware Target Construction

Long-horizon manipulation is phase-structured, and a single global target set can pull a rollout toward the wrong stage or collapse valid behavior modes. ReGuide augments each state with temporal differences [v, p, Δv, Δp], clusters states after PCA, orders clusters by mean timestep, and merges them into macro-phases. Each phase keeps multiple representative centroids as a target set, and a soft-minimum distance preserves multimodality while remaining differentiable.

Phase-aware target construction

Phase-aware target construction. Latent states are clustered using temporally augmented features, ordered by trajectory time, and grouped into macro-phases. Representative centroids from each phase define the target sets used for phase-conditioned guidance.


2. Phase-Conditioned Guidance (PCG)

Dynamics gradients are reliable only near the data distribution. ReGuide guides through the estimated clean action (adapting MPGD) to match the dynamics model's training distribution, and applies a two-threshold gate: guidance is active only in the drifted-but-recoverable regime. The lower threshold avoids perturbing already-correct actions near the demonstration manifold; the upper threshold prevents trusting the dynamics model where it extrapolates. Thresholds are phase-specific percentiles, so the gate adapts to the local geometry of each stage.

Three guidance regimes

Three regimes. Guidance is applied only between the lower and upper distance thresholds — off near the demonstrations (in-distribution) and off in the far extrapolation region.


3. Iterative Self-Improvement

Successful guided rollouts are filtered by a trajectory-level success signal and merged into the training set. ReGuide absorbs them in two complementary ways, which can also be composed:

  • ReGuide-FT — fine-tunes the current checkpoint on a rehearsal mixture of demonstrations and guided rollouts. Cheap, and preserves existing competence.
  • ReGuide-FS — retrains a fresh policy from scratch on the augmented dataset, escaping the base policy's local minimum at the cost of a full training run.
  • ReGuide-FS→FT — applies ReGuide-FT on top of a ReGuide-FS checkpoint, combining the strengths of both.

The updated policy then generates a new batch of higher-quality recovery data, yielding an iterative rollout–collect–train loop that improves the policy without additional expert demonstrations.

Results

On four Robomimic tasks in a low-data regime, all ReGuide variants substantially improve over the base diffusion policy. The composition ReGuide-FS→FT gives the best result on Can, Square, and Transport, while iterated ReGuide-FT remains slightly stronger on Tool Hang.

Main results

Main results. ReGuide improves base-policy success by 1.3–7.7× across tasks. ReGuide-FT and ReGuide-FS are complementary; their composition is strongest on Can, Square, and Transport. Error bars are standard error of the mean across 50 per-seed success rates.


Success rates across Robomimic tasks. PCG, ReGuide-FT, ReGuide-FS, and ReGuide-FS→FT all outperform the base policy and LPB. Best is bold, second-best is underlined.

Task Base Policy LPB PCG ReGuide-FT (it.1) ReGuide-FT (it.2) ReGuide-FS ReGuide-FS→FT
Can 0.492 0.501 0.601 0.624 0.676 0.723
Square 0.563 0.561 0.584 0.659 0.712 0.714 0.734
Transport 0.466 0.484 0.514 0.646 0.695 0.682 0.706
Tool Hang 0.031 0.031 0.038 0.159 0.240 0.166 0.233

Iterative self-improvement

The second iteration of ReGuide-FT improves over the first on all four tasks, showing that the updated policy can generate useful new guided rollouts. Gains are not unbounded: reaching the second-iteration peak requires more total guided rollouts, reflecting diminishing returns as the policy improves and the fixed dynamics model becomes less aligned with it.

Iterative self-improvement

Iterative self-improvement. The x-axis shows the cumulative number of guided rollouts collected to update the policy; iteration 2 breaks the single-round ceiling on every task.

Ablations

We isolate the main design choices in ReGuide. Test-time guidance ablations are run on Transport; data-absorption ablations on Can and Square.

Guided rollouts vs. additional data

A natural concern is that ReGuide's gains come from extra data rather than guidance specifically. At matched rollout counts, guided rollouts beat unguided base-policy rollouts on every task under both ReGuide-FS and ReGuide-FT — ruling out the "just more data" explanation.

Guided vs base-policy rollouts at matched data

Guided vs. unguided rollouts at matched dataset size. Guided rollouts (orange) dominate unguided rollouts (gray) at every count, across tasks and both absorption modes.


Guidance design (test-time, Transport)

Three choices drive PCG's test-time gains: an intermediate number of phase targets (preserving multimodality without diluting the signal), the two-threshold gate (the upper threshold contributes most of the benefit), and differentiating through the estimated clean action rather than the noisy iterate.

Number of phase targets

Phase targets $M$. An intermediate $M\approx50$ is best.

Two-threshold gate

Gating. The full lower+upper gate wins.

Clean action vs noisy iterate

Guidance target. Clean-action guidance wins.


Absorbing guided data

The buffer-share ratio ρ in ReGuide-FT is a mild stability/plasticity knob (roughly flat around its optimum). At the composition step, ReGuide-FT and ReGuide-FS reach statistically comparable performance — making ReGuide-FT the better engineering choice for further iterations, since it converges in far fewer steps from an already-competent checkpoint.

Buffer-share ratio

Buffer-share ratio ρ. Performance is roughly flat; ρ ≈ 0.7 is best on Can.

ReGuide-FT vs ReGuide-FS

ReGuide-FT vs. ReGuide-FS. Comparable accuracy; FT is cheaper.

BibTeX

@misc{lin2026reguide,
  author    = {Lin, Tzu-Hsiang and Shakkottai, Srinivas and Kalathil, Dileep and Kumar, P. R.},
  title     = {ReGuide: From Test-Time Guidance to Self-Improving Diffusion Policies},
  year      = {2026},
}