CDP: Towards Robust Autoregressive Visuomotor Policy Learning via Causal Diffusion

CoRL 2025

Jiahua Ma^1*, Yiran Qin^23*†, Yixiong Li¹, Xuanqi Liao¹, Yulan Guo¹, Ruimao Zhang^1✉

¹Sun Yat-sen University; ²CUHK(SZ); ³Oxford;
^*Equal Contribution ^†Project Leader ^✉Corresponding Author

Motivation

Causal Diffusion Policy: a transformer-based diffusion model that enhances action prediction by conditioning on historical action sequences. (A): When performing the task of “grabbing the barrier” in practice, (B): the quality of observations is degraded by factors such as sensor noise, occlusions, and hardware limitations. In fact, this degraded but high-dimensional observation data not only fails to provide sufficient spatial constraint information for policy planning but also slows down the planning speed. (C): In this case, the robot is unable to perform accurate manipulation. (D): In this paper, we address historical action sequences to introduce temporally rich context as a supplement, which enables more robust policy generation.

Real World Demos

Collecting Objects

Push T

Method Overview

Causal Action Generation of our CDP. (a) During training, the Historical Actions are combined with the Denoising Targets. This combined input is then fed into the Causal Action Generation module, which contains P blocks, for denoising. The Target Actions are used for training supervision. Before denoising, the Historical Actions are perturbed by a small-scale noise, which helps to reduce the accumulation of action prediction errors during inference. (b) The Causal Temporal Attention Mask ensures each Denoising Target can access all Historical Actions.

Chunk-wise Autoregressive inference of our CDP. (a) The orange and purple blocks denote actions whose Key and Value representations have been cached and not cached until the current step, respectively. The yellow block denotes the Gaussian noises. During the AR-k step, we perform denoising while simultaneously computing and storing the Key and Value representations for the Uncached Historical Actions. After denoising, the Target Actions generated in the AR-k are applied to the environment and serve as the Uncached Historical Actions in the AR-k+1 step. (b) During Attention Computation, the Uncached Historical Actions are restricted to considering only actions within its own chunk (the purple line), while the Denoising Targets has access to the entire action sequence (the yellow line).

Quantitative Results of Simulation Experiments

Simulated Demos

Comprehensive Visualization of Simulated Experimental Results on Robofactory, Dexart, Metaworld, and Adroit benchmarks, evaluating model efficacy in gripper and dexterous manipulation, articulated and rigid object handling, and single- and multi-agent systems.

Ablation Studies

Conclusion

Our proposed CDP effectively counteracts the impact of observation-quality degradation—arising from sensor noise, occlusions, and hardware limitations—on reliable robotic manipulation. Built upon a causal-transformer diffusion framework, the method captures critical temporal dependencies, thereby compensating for the spatial constraints diminished by degraded observations and preserving task performance. Extensive evaluations demonstrate that our CDP consistently outperforms existing approaches, attesting to its robustness and effectiveness. To facilitate real-time inference, we further introduce a Cache Sharing Mechanism that reuses pre-computed key–value attention tensors, yielding substantial reductions in the computational burden of autoregressive action generation without compromising accuracy.