Diffusion Policy (DP) enables robots to learn complex behaviors by imitating expert demonstrations through action diffusion. However, in practical applications, hardware limitations often degrade data quality, while real-time constraints restrict model inference to instantaneous state and scene observations. These limitations seriously reduce the efficacy of learning from expert demonstrations, resulting in failures in object localization, grasp planning, and long-horizon task execution. To address these challenges, we propose Causal Diffusion Policy (CDP), a novel transformer-based diffusion model that enhances action prediction by conditioning on historical action sequences, thereby enabling more coherent and context-aware visuomotor policy learning. To further mitigate the computational cost associated with autoregressive inference, a caching mechanism is also introduced to store attention key-value pairs from previous timesteps, substantially reducing redundant computations during execution. Extensive experiments in both simulated and real-world environments, spanning diverse 2D and 3D manipulation tasks, demonstrate that CDP uniquely leverages historical action sequences to achieve significantly higher accuracy than existing methods. Moreover, even when faced with degraded input observation quality, CDP maintains remarkable precision by reasoning through temporal continuity, which highlights its practical robustness for robotic control under realistic, imperfect conditions.
Causal Action Generation of our CDP. (a) During training, the Historical Actions are combined with the Denoising Targets. This combined input is then fed into the Causal Action Generation module, which contains P blocks, for denoising. The Target Actions are used for training supervision. Before denoising, the Historical Actions are perturbed by a small-scale noise, which helps to reduce the accumulation of action prediction errors during inference. (b) The Causal Temporal Attention Mask ensures each Denoising Target can access all Historical Actions.
Chunk-wise Autoregressive inference of our CDP. (a) The orange and purple blocks denote actions whose Key and Value representations have been cached and not cached until the current step, respectively. The yellow block denotes the Gaussian noises. During the AR-k
step, we perform denoising while simultaneously computing and storing the Key and Value representations for the Uncached Historical Actions. After denoising, the Target Actions generated in the AR-k
are applied to the environment and serve as the Uncached Historical Actions in the AR-k+1
step.
(b) During Attention Computation, the Uncached Historical Actions are restricted to considering only actions within its own chunk (the purple line), while the Denoising Targets has access to the entire action sequence (the yellow line).