Denoising Reuse: Exploiting Inter-frame Motion Consistency in Efficient Video Latent Generation

Anonymous author

Additional Experimental Results

Long Video Generation

In the universe, the earth revolves around the sun.

Combine with AnimateDiff [5]

With Dr. Mo
Without Dr. Mo
A white swan swimming in the water. Dramatic ocean sunset. An apple is falling from a tree.

Combine with SimDA [4]

With Dr. Mo
Without Dr. Mo
A white swan swimming in the water. Dramatic ocean sunset. An apple is falling from a tree.

The speed metrics are based on generating 16-frame videos at a resolution of 512x512. CLIPSIM is the average CLIP similarity between the video frames and the prompt.

Experimental results of training DSS network with every 100 timesteps

\( e_t \) represents the transformation error at timestep \( t \).

Explanation of Normalized Mutual Information(NMI)

Video Generation Results


A person riding a horse Big beautiful mountain with waterfall, a long view White clouds floating in the sky over the valley river A boy playing guitar A white swan swimming in the water
An apple is falling from a tree Red car running, a close-up video Sailboat sailing on the sea at dusk A football player shooting A person walking front with his friends on the grass

Abstract

Video generation using diffusion-based models is constrained by high computational costs due to the frame-wise iterative diffusion process. This work presents a Diffusion Reuse MOtion (Dr. Mo) network to accelerate latent video generation. Our key discovery is that coarse-grained noises in earlier denoising steps have demonstrated high motion consistency across consecutive video frames. Following this observation, Dr. Mo propagates those coarse-grained noises onto the next frame by incorporating carefully designed, lightweight inter-frame motions, eliminating massive computational redundancy in frame-wise diffusion models. The more sensitive and fine-grained noises are still acquired via later denoising steps, which can be essential to retain visual qualities. As such, deciding which intermediate steps should switch from motion-based propagations to denoising can be a crucial problem and a key tradeoff between efficiency and quality. Dr. Mo employs a meta-network named Denoising Step Selector (DSS) to dynamically determine desirable intermediate steps across video frames. Extensive evaluations on video generation and editing tasks have shown that Dr. Mo can substantially accelerate diffusion models in video tasks with improved visual qualities.

Method Overview


Dr. Mo consists of two main components: the Motion Transformation Network (MTN) and Denoising Step Selector (DSS). MTN learns motion matrices from semantic latents extracted from U-Net. The DSS is a meta-network that determines the appropriate transition step (denoted as \( t^* \)) for switching from motion-based propagations to denoising. After the transition step, those latent noise is processed by the rest of the diffusion model for video generation.


To learn \( t^* \), DSS takes the statistics derived from motion matrices \( \{M_{δz_t}^{i,j}\}_{t=1}^T \) as input, including corresponding timestep indices and the NMI scores. It then implements a recurrent neural network and outputs \( \hat t \), the estimated most suitable switch step. DSS is updated according to the cross-entropy loss between the predicted switching step \( \hat t \) and the ground truth \( t^* \).

Style Transfer Results

Vanilla video (Left), Style transferred first frame (Middle), New style video (Right)

Motion Matrix


Hover over the patches to see the corresponding motion transformations between two frames


Video Generation Comparison

CogVideo[1]
Latent-Shift[2]
Dr. Mo (Ours)
A person playing piano A person doing handstand pushups A person performing a bench press A person knitting
VDM[3]
SimDA[4]
Dr. Mo (Ours)
Mountain river Path in a tropical forest Forest in Autumn Dramatic ocean sunset