CDP: Towards Robust Autoregressive Visuomotor Policy Learning via Causal Diffusion

CoRL 2025

Jiahua Ma1*, Yiran Qin2*†, Yixiong Li1, Xuanqi Liao1, Yulan Guo1, Ruimao Zhang1✉
1Sun Yat-sen University; 2Oxford University;
*Equal Contribution
   Project Leader    Corresponding Author

Motivation

MY ALT TEXT

Causal Diffusion Policy: a transformer-based diffusion model that enhances action prediction by conditioning on historical action sequences. (A): When performing the task of “grabbing the barrier” in practice, (B): the quality of observations is degraded by factors such as sensor noise, occlusions, and hardware limitations. In fact, this degraded but high-dimensional observation data not only fails to provide sufficient spatial constraint information for policy planning but also slows down the planning speed. (C): In this case, the robot is unable to perform accurate manipulation. (D): In this paper, we address historical action sequences to introduce temporally rich context as a supplement, which enables more robust policy generation.

Real World Demos

Collecting Objects

Push T

Abstract

Diffusion Policy (DP) enables robots to learn complex behaviors by imitating expert demonstrations through action diffusion. However, in practical applications, hardware limitations often degrade data quality, while real-time constraints restrict model inference to instantaneous state and scene observations. These limitations seriously reduce the efficacy of learning from expert demonstrations, resulting in failures in object localization, grasp planning, and long-horizon task execution. To address these challenges, we propose Causal Diffusion Policy (CDP), a novel transformer-based diffusion model that enhances action prediction by conditioning on historical action sequences, thereby enabling more coherent and context-aware visuomotor policy learning. To further mitigate the computational cost associated with autoregressive inference, a caching mechanism is also introduced to store attention key-value pairs from previous timesteps, substantially reducing redundant computations during execution. Extensive experiments in both simulated and real-world environments, spanning diverse 2D and 3D manipulation tasks, demonstrate that CDP uniquely leverages historical action sequences to achieve significantly higher accuracy than existing methods. Moreover, even when faced with degraded input observation quality, CDP maintains remarkable precision by reasoning through temporal continuity, which highlights its practical robustness for robotic control under realistic, imperfect conditions.

Method Overview

Quantitative Results of Simulation Experiments

MY ALT TEXT

Simulated Demos

MY ALT TEXT

Comprehensive Visualization of Simulated Experimental Results on Robofactory, Dexart, Metaworld, and Adroit benchmarks, evaluating model efficacy in gripper and dexterous manipulation, articulated and rigid object handling, and single- and multi-agent systems.

Ablation Studies

MY ALT TEXT
MY ALT TEXT

Conclusion

Our proposed CDP effectively counteracts the impact of observation-quality degradation—arising from sensor noise, occlusions, and hardware limitations—on reliable robotic manipulation. Built upon a causal-transformer diffusion framework, the method captures critical temporal dependencies, thereby compensating for the spatial constraints diminished by degraded observations and preserving task performance. Extensive evaluations demonstrate that our CDP consistently outperforms existing approaches, attesting to its robustness and effectiveness. To facilitate real-time inference, we further introduce a Cache Sharing Mechanism that reuses pre-computed key–value attention tensors, yielding substantial reductions in the computational burden of autoregressive action generation without compromising accuracy.

BibTeX


@article{ma2025cdp,
  title={CDP: Towards Robust Autoregressive Visuomotor Policy Learning via Causal Diffusion},
  author={Ma, Jiahua and Qin, Yiran and Li, Yixiong and Liao, Xuanqi and Guo, Yulan and Zhang, Ruimao},
  journal={arXiv preprint arXiv:2506.14769},
  year={2025}
}