Denoising Reuse: Exploiting Inter-frame Motion Consistency in Efficient Video Latent Generation

Denoising-based diffusion models have attained impressive image synthesis; however, their applications on videos can lead to unaffordable computational costs due to the per-frame denoising operations. In pursuit of efffcient video generation, we present a Diffusion Reuse MOtion (Dr. Mo) network to accelerate the video-based denoising process. Our crucial observation is that the latent representations in early denoising steps between adjacent video frames exhibit high consistencies with motion clues. Inspired by the discovery, we propose to accelerate the video denoising process by incorporating lightweight, learnable motion features. Speciffcally, Dr. Mo will only compute all denoising steps for base frames. For a non-based frame, Dr. Mo will propagate the pre-computed based latents of a particular step with interframe motions to obtain a fast estimation of its coarse-grained latent representation, from which the denoising will continue to obtain more sensitive and ffne-grained representations. On top of this, Dr. Mo employs a meta-network named Denoising Step Selector (DSS) to dynamically determine the step to perform motion-based propagations for each frame, which can be viewed as a tradeoff between quality and efffciency. Extensive evaluations on video generation and editing tasks indicate that Dr. Mo delivers widely applicable acceleration for diffusion-based video generations while effectively retaining the visual quality and style.


A person riding a horse	A big beautiful mountain with a waterfall, a long view	A hot air balloon takes to the sky	A boy playing guitar	A minion waved his hand and the UFO flew over

In the universe, the earth revolves around the sun	Red car running, a close-up video	A sailboat sailing on the sea	PeppaPig, Cartoon style, two pig running on the grassland	The plane went through the white clouds

With Dr. Mo
Without Dr. Mo
	A white swan swimming in the water.	Dramatic ocean sunset.	An apple is falling from a tree.

With Dr. Mo
Without Dr. Mo
	A white swan swimming in the water.	Dramatic ocean sunset.	An apple is falling from a tree.

CogVideo[1]
Latent-Shift[2]
Dr. Mo (Ours)
	A person playing piano	A person doing handstand pushups	A person performing a bench press	A person knitting

VDM[3]
SimDA[4]
Dr. Mo (Ours)
	Mountain river	Path in a tropical forest	Forest in Autumn	Dramatic ocean sunset

Denoising Reuse: Exploiting Inter-frame Motion Consistency in Efficient Video Latent Generation

Video Generation Results

Abstract

Method Overview

Motion Matrix

Hover over the patches to see the corresponding motion transformations between two frames

Combine with AnimateDiff [5]

Combine with SimDA [4]

Video Generation Comparison

Style Transfer Results

Vanilla video (Left), Style transferred first frame (Middle), New style video (Right)