Denoising Reuse: Exploiting Inter-frame Motion Consistency in Efficient Video Latent Generation

Chenyu Wang*, Shuo Yan*, Yixuan Chen*, Xianwei Wang, Yujiang Wang, Mingzhi Dong, Xiaochen Yang, Dongsheng Li, Rui Zhu, David Clifton, Robert P. Dick, Qin Lv, Fan Yang, Tun Lu, Ning Gu, Li Shang,

*Indicates Equal Contribution

Video Generation Results


A person riding a horse A big beautiful mountain with a waterfall, a long view A hot air balloon takes to the sky A boy playing guitar A minion waved his hand and the UFO flew over
In the universe, the earth revolves around the sun Red car running, a close-up video A sailboat sailing on the sea PeppaPig, Cartoon style, two pig running on the grassland The plane went through the white clouds

Abstract

Denoising-based diffusion models have attained impressive image synthesis; however, their applications on videos can lead to unaffordable computational costs due to the per-frame denoising operations. In pursuit of efffcient video generation, we present a Diffusion Reuse MOtion (Dr. Mo) network to accelerate the video-based denoising process. Our crucial observation is that the latent representations in early denoising steps between adjacent video frames exhibit high consistencies with motion clues. Inspired by the discovery, we propose to accelerate the video denoising process by incorporating lightweight, learnable motion features. Speciffcally, Dr. Mo will only compute all denoising steps for base frames. For a non-based frame, Dr. Mo will propagate the pre-computed based latents of a particular step with interframe motions to obtain a fast estimation of its coarse-grained latent representation, from which the denoising will continue to obtain more sensitive and ffne-grained representations. On top of this, Dr. Mo employs a meta-network named Denoising Step Selector (DSS) to dynamically determine the step to perform motion-based propagations for each frame, which can be viewed as a tradeoff between quality and efffciency. Extensive evaluations on video generation and editing tasks indicate that Dr. Mo delivers widely applicable acceleration for diffusion-based video generations while effectively retaining the visual quality and style.

Method Overview


Dr. Mo consists of two main components: the Motion Transformation Network (MTN) and Denoising Step Selector (DSS). MTN learns motion matrices from semantic latents extracted from U-Net and predicts motion matrices for future frames. The DSS is a meta-network that determines the appropriate transition step (denoted as \( t^* \)) for switching from motion-based propagations to denoising. After the transition step, those latent noise is processed by the rest of the diffusion model for video generation.


The DSS module consists of a bidirectional RNN network and an MLP, designed to predict the optimal timestep \( \hat t \) for transformation based on the input motion matrices at different timesteps. The model is trained using cross-entropy loss between \( \hat t \) and the ground truth timestep \( t^* \).

Motion Matrix


Hover over the patches to see the corresponding motion transformations between two frames


Combine with AnimateDiff [5]

With Dr. Mo
Without Dr. Mo
A white swan swimming in the water. Dramatic ocean sunset. An apple is falling from a tree.

Combine with SimDA [4]

With Dr. Mo
Without Dr. Mo
A white swan swimming in the water. Dramatic ocean sunset. An apple is falling from a tree.

Video Generation Comparison

CogVideo[1]
Latent-Shift[2]
Dr. Mo (Ours)
A person playing piano A person doing handstand pushups A person performing a bench press A person knitting
VDM[3]
SimDA[4]
Dr. Mo (Ours)
Mountain river Path in a tropical forest Forest in Autumn Dramatic ocean sunset

Style Transfer Results

Vanilla video (Left), Style transferred first frame (Middle), New style video (Right)