Skip to content
/ mcm Public

Motion Consistency Model: Accelerating Video Diffusion with Disentangled Motion-Appearance Distillation

License

Notifications You must be signed in to change notification settings

yhZhai/mcm

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Colab

Yuanhao Zhai1, Kevin Lin2, Zhengyuan Yang2, Linjie Li2, Jianfeng Wang2, Chung-Ching Lin2, David Doermann1, Junsong Yuan1, Lijuan Wang2

1State University of New Yort at Buffalo   |   2Microsoft

TL;DR: Our motion consistency model not only accelerates text2video diffusion model sampling process, but also can benefit from an additional high-quality image dataset to improve the frame quality of generated videos.

Our motion consistency model not only distill the motion prior from the teacher to accelerate sampling, but also can benefit from an additional high-quality image dataset to improve the frame quality of generated videos.

🔥 News

[06/2024] Our MCM achieves strong performance (using 4 sampling steps) on the ChronoMagic-Bench! Check out the leaderboard here.

[06/2024] Training code, pre-trained checkpoint, Gradio demo, and Colab demo release.

[06/2024] Paper and project page release.

Contents

Getting started

Environment setup

Instead of installing diffusers, peft, and open_clip from the official repos, we use our modified versions specified in the requirements.txt file. This is particularly important for diffusers and open_clip, due to the former's current limited support for video diffusion model LoRA loading, and the latter's distributed training dependency.

To set up the environment, run the following commands:

pip install torch==2.1.2 torchvision==0.16.2 torchaudio==2.1.2 --index-url https://download.pytorch.org/whl/cu118  # please modify the cuda version according to your env 
pip install -r requirements.txt
pip install scipy==1.11.1
pip install https://github.com/podgorskiy/dnnlib/releases/download/0.0.1/dnnlib-0.0.1-py3-none-any.whl

Data preparation

Please preparation the video and optional image datasets in the webdataset format.

Specifically, please wrap the video/image files and their corresponding .json format metadata into .tar files. Here is an example structure of the video .tar file:

.
├── video_0.json
├── video_0.mp4
...
├── video_n.json
└── video_n.mp4

The .json files contain video/image captions in key-value pairs, for example: {"caption": "World map in gray - world map with animated circles and binary numbers"}.

We provide our generated anime, realistic, and 3D cartoon style image datasets here (coming soom). Due to dataset agreement, we could not publicly release the WebVid and LAION-aes dataset.

DINOv2 and CLIP checkpoint download

We provide a script scripts/download.py to download the DINOv2 and CLIP checkpoint.

python scripts/download.py

Wandb integration

Please input your wandb API key in utils/wandb.py to enable wandb logging. If you do not use wandb, please remove wandb from the --report_to argument in the training command.

Training

We leverage accelerate for distributed training, and we support two different based text2video diffusion models: ModelScopeT2V and AnimateDiff. For both models, we train LoRA instead fine-tuning all parameters.

ModelScopeT2V

For ModelScopeT2V, our code supports pure video diffusion distillation training, and frame quality improvement training.

By default, the training script requires 8 GPUs, each with 80GB of GPU memory, to fit a batch size of 4. The minimal GPU memory requirement is 32GB for a batch size of 1. Please adjust the --train_batch_size argument accordingly for different GPU memory sizes.

Before running the scripts, please modify the data path in the environment variables defined at the top of each script.

Diffusion distillation

We provide the training script in scripts/modelscopet2v_distillation.sh

bash scripts/modelscopet2v_distillation.sh

Frame quality improvement

We provide the training script in scripts/modelscopet2v_improvement.sh. Before running, please assign the IMAGE_DATA_PATH in the script.

bash scripts/modelscopet2v_improvement.sh

AnimateDiff

Due to the higher resolution requirement, MCM with AnimateDiff base model training requires at least 70GB of GPU memory to fit a single batch.

We provide the diffusion distillation training script in scripts/animatediff_distillation.sh.

bash scripts/animatediff_distillation.sh

Inference

We provide our pre-trained checkpoint here, Gradio demo here, and Colab demo here. demo.py showcases how to run our MCM in local machine. Feel free to try out our MCM!

MCM weights

We provide our pre-trained checkpoint here.

Acknowledgement

Some of our implementations are borrowed from the great repos below.

  1. Diffusers
  2. StyleGAN-T
  3. GMFlow

Citation

@article{zhai2024motion,
  title={Motion Consistency Model: Accelerating Video Diffusion with Disentangled
  Motion-Appearance Distillation},
  author={Zhai, Yuanhao and Lin, Kevin and Yang, Zhengyuan and Li, Linjie and Wang, Jianfeng and Lin, Chung-Ching and Doermann, David and Yuan, Junsong and Wang, Lijuan},
  year={2024},
  journal={arXiv preprint arXiv:2406.06890},
  website={https://yhzhai.github.io/mcm/},
}