🚀 Motion Consistency Model: Accelerating Video Diffusion with Disentangled Motion-Appearance Distillation

Yuanhao Zhai¹, Kevin Lin², Zhengyuan Yang², Linjie Li², Jianfeng Wang², Chung-Ching Lin², David Doermann¹, Junsong Yuan¹, Lijuan Wang²

¹State University of New Yort at Buffalo | ²Microsoft

TL;DR: Our motion consistency model not only accelerates text2video diffusion model sampling process, but also can benefit from an additional high-quality image dataset to improve the frame quality of generated videos.

🔥 News

[06/2024] Our MCM achieves strong performance (using 4 sampling steps) on the ChronoMagic-Bench! Check out the leaderboard here.

[06/2024] Training code, pre-trained checkpoint, Gradio demo, and Colab demo release.

[06/2024] Paper and project page release.

Getting started

Environment setup

Instead of installing diffusers, peft, and open_clip from the official repos, we use our modified versions specified in the requirements.txt file. This is particularly important for diffusers and open_clip, due to the former's current limited support for video diffusion model LoRA loading, and the latter's distributed training dependency.

To set up the environment, run the following commands:

pip install torch==2.1.2 torchvision==0.16.2 torchaudio==2.1.2 --index-url https://download.pytorch.org/whl/cu118  # please modify the cuda version according to your env 
pip install -r requirements.txt
pip install scipy==1.11.1
pip install https://github.com/podgorskiy/dnnlib/releases/download/0.0.1/dnnlib-0.0.1-py3-none-any.whl

Data preparation

Please preparation the video and optional image datasets in the webdataset format.

Specifically, please wrap the video/image files and their corresponding .json format metadata into .tar files. Here is an example structure of the video .tar file:

.
├── video_0.json
├── video_0.mp4
...
├── video_n.json
└── video_n.mp4

The .json files contain video/image captions in key-value pairs, for example: {"caption": "World map in gray - world map with animated circles and binary numbers"}.

We provide our generated anime, realistic, and 3D cartoon style image datasets here (coming soom). Due to dataset agreement, we could not publicly release the WebVid and LAION-aes dataset.

DINOv2 and CLIP checkpoint download

We provide a script scripts/download.py to download the DINOv2 and CLIP checkpoint.

python scripts/download.py

Wandb integration

Please input your wandb API key in utils/wandb.py to enable wandb logging. If you do not use wandb, please remove wandb from the --report_to argument in the training command.

Training

We leverage accelerate for distributed training, and we support two different based text2video diffusion models: ModelScopeT2V and AnimateDiff. For both models, we train LoRA instead fine-tuning all parameters.

ModelScopeT2V

For ModelScopeT2V, our code supports pure video diffusion distillation training, and frame quality improvement training.

By default, the training script requires 8 GPUs, each with 80GB of GPU memory, to fit a batch size of 4. The minimal GPU memory requirement is 32GB for a batch size of 1. Please adjust the --train_batch_size argument accordingly for different GPU memory sizes.

Before running the scripts, please modify the data path in the environment variables defined at the top of each script.

Diffusion distillation

We provide the training script in scripts/modelscopet2v_distillation.sh

bash scripts/modelscopet2v_distillation.sh

Frame quality improvement

We provide the training script in scripts/modelscopet2v_improvement.sh. Before running, please assign the IMAGE_DATA_PATH in the script.

bash scripts/modelscopet2v_improvement.sh

AnimateDiff

Due to the higher resolution requirement, MCM with AnimateDiff base model training requires at least 70GB of GPU memory to fit a single batch.

We provide the diffusion distillation training script in scripts/animatediff_distillation.sh.

bash scripts/animatediff_distillation.sh

Inference

We provide our pre-trained checkpoint here, Gradio demo here, and Colab demo here. demo.py showcases how to run our MCM in local machine. Feel free to try out our MCM!

MCM weights

We provide our pre-trained checkpoint here.

Acknowledgement

Some of our implementations are borrowed from the great repos below.

Citation

@article{zhai2024motion,
  title={Motion Consistency Model: Accelerating Video Diffusion with Disentangled
  Motion-Appearance Distillation},
  author={Zhai, Yuanhao and Lin, Kevin and Yang, Zhengyuan and Li, Linjie and Wang, Jianfeng and Lin, Chung-Ching and Doermann, David and Yuan, Junsong and Wang, Lijuan},
  year={2024},
  journal={arXiv preprint arXiv:2406.06890},
  website={https://yhzhai.github.io/mcm/},
}

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
dataset		dataset
models		models
scripts		scripts
static		static
utils		utils
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
args.py		args.py
demo.py		demo.py
main.py		main.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🚀 Motion Consistency Model: Accelerating Video Diffusion with Disentangled Motion-Appearance Distillation

🔥 News

Contents

Getting started

Environment setup

Data preparation

DINOv2 and CLIP checkpoint download

Wandb integration

Training

ModelScopeT2V

AnimateDiff

Inference

MCM weights

Acknowledgement

Citation

About

Releases

Packages

Languages

License

yhZhai/mcm

Folders and files

Latest commit

History

Repository files navigation

🚀 Motion Consistency Model: Accelerating Video Diffusion with Disentangled Motion-Appearance Distillation

🔥 News

Contents

Getting started

Environment setup

Data preparation

DINOv2 and CLIP checkpoint download

Wandb integration

Training

ModelScopeT2V

AnimateDiff

Inference

MCM weights

Acknowledgement

Citation

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages