https://github.com/bytedance/x-dyna
[ArXiv 2024] X-Dyna: Expressive Dynamic Human Image Animation
Science Score: 23.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
✓codemeta.json file
Found codemeta.json file -
○.zenodo.json file
-
○DOI references
-
✓Academic publication links
Links to: arxiv.org, scholar.google -
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (7.0%) to scientific vocabulary
Keywords
Repository
[ArXiv 2024] X-Dyna: Expressive Dynamic Human Image Animation
Basic Info
- Host: GitHub
- Owner: bytedance
- License: apache-2.0
- Language: Python
- Default Branch: main
- Homepage: https://x-dyna.github.io/xdyna.github.io/
- Size: 18.1 MB
Statistics
- Stars: 137
- Watchers: 5
- Forks: 9
- Open Issues: 4
- Releases: 0
Topics
Metadata Files
README.md
X-Dyna: Expressive Dynamic Human Image Animation
Di Chang1,2
·
Hongyi Xu2*
·
You Xie2*
·
Yipeng Gao1*
·
Zhengfei Kuang3*
·
Shengqu Cai3*
·
Chenxu Zhang2*
Guoxian Song2
·
Chao Wang2
·
Yichun Shi2
·
Zeyuan Chen2,5
·
Shijie Zhou4
·
Linjie Luo2
Gordon Wetzstein3
·
Mohammad Soleymani1
1Unviersity of Southern California 2ByteDance Inc. 3Stanford University
4University of California Los Angeles 5University of California San Diego
* denotes equal contribution
This repo is the official pytorch implementation of X-Dyna, which generates temporal-consistent human motions with expressive dynamics.
📑 Open-source Plan
- [x] Project Page
- [x] Paper
- [x] Inference code for Dynamics Adapter
- [x] Checkpoints for Dynamics Adapter
- [x] Inference code for S-Face ControlNet
- [x] Checkpoints for S-Face ControlNet
- [ ] Evaluation code (DTFVD, Face-Cos, Face-Det, FID, etc.)
- [ ] Dynamic Texture Eval Data (self-collected from Pexels)
- [ ] Alignment code for inference
- [ ] Gradio Demo
Abstract
We introduce X-Dyna, a novel zero-shot, diffusion-based pipeline for animating a single human image using facial expressions and body movements derived from a driving video, that generates realistic, context-aware dynamics for both the subject and the surrounding environment. Building on prior approaches centered on human pose control, X-Dyna addresses key factors underlying the loss of dynamic details, enhancing the lifelike qualities of human video animations. At the core of our approach is the Dynamics-Adapter, a lightweight module that effectively integrates reference appearance context into the spatial attentions of the diffusion backbone while preserving the capacity of motion modules in synthesizing fluid and intricate dynamic details. Beyond body pose control, we connect a local control module with our model to capture identity-disentangled facial expressions, facilitating accurate expression transfer for enhanced realism in animated scenes. Together, these components form a unified framework capable of learning physical human motion and natural scene dynamics from a diverse blend of human and scene videos. Comprehensive qualitative and quantitative evaluations demonstrate that X-Dyna outperforms state-of-the-art methods, creating highly lifelike and expressive animations.
Architecture
We leverage a pretrained diffusion UNet backbone for controlled human image animation, enabling expressive dynamic details and precise motion control. Specifically, we introduce a dynamics adapter that seamlessly integrates the reference image context as a trainable residual to the spatial attention, in parallel with the denoising process, while preserving the original spatial and temporal attention mechanisms within the UNet. In addition to body pose control via a ControlNet, we introduce a local face control module that implicitly learns facial expression control from a synthesized cross-identity face patch. We train our model on a diverse dataset of human motion videos and natural scene videos simultaneously.
Dynamics Adapter
Archtecture Designs for Human Video Animation
a) IP-Adapter encodes the reference image as an image CLIP embedding and injects the information into the cross-attention layers in SD as the residual. b) ReferenceNet is a trainable parallel UNet and feeds the semantic information into SD via concatenation of self-attention features. c) Dynamics-Adapter encodes the reference image with a partially shared-weight UNet. The appearance control is realized by learning a residual in the self-attention with trainable query and output linear layers. All other components share the same frozen weight with SD.
https://github.com/user-attachments/assets/a4a679fd-b8e1-4f9a-ad9c-24adb0ca33eb
📈 Results
Comparison
To evaluate the dynamics texture generation performance of X-Dyna in human video animation, we compare the generation results of X-Dyna with MagicPose (ReferenceNet-based method) and MimicMotion (SVD-based method). For a fair comparison, all generated videos share the same resolution of Height x Width = 896 x 512.
https://github.com/user-attachments/assets/436a6d6c-9579-446d-831e-6ff2195147c3
https://github.com/user-attachments/assets/5369163a-d0f6-4389-baf4-b77fcd2b7527
https://github.com/user-attachments/assets/0d1f14b3-92ad-4df8-8c34-5f9185be2905
https://github.com/user-attachments/assets/566fd91c-b488-46fc-8841-6f9462b22b26
https://github.com/user-attachments/assets/ac6a8463-0684-469c-b5ba-513697c715d7
Ablation
To evaluate the effectiveness of the mix data training in our pipeline, we present a visualized ablation study.
https://github.com/user-attachments/assets/064a6cf7-979d-459f-aa76-32f479d09ecc
🎥 More Demos
📜 Requirements
- An NVIDIA GPU with CUDA support is required.
- We have tested on a single A100 GPU.
- In our experiment, we used CUDA 11.8.
- Minimum: The minimum GPU memory required is 20GB for generating a single video (batch_size=1) of 16 frames.
- Recommended: We recommend using a GPU with 80GB of memory.
- Operating system: Linux Debian 11 (bullseye)
🛠️ Dependencies and Installation
Clone the repository:
shell
git clone https://github.com/Boese0601/X-Dyna
cd X-Dyna
Installation Guide
We provide an requirements.txt file for setting up the environment.
Run the following command on your terminal: ```shell
1. Prepare conda environment
conda create -n xdyna python==3.10
2. Activate the environment
conda activate xdyna
3. Install dependencies
bash envtorch2install.sh
I know it's a bit weird that pytorch is installed with different versions twice in that bash file, but I don't know why it doesn't work if I directly installed the final one (torch==2.0.1+cu118 torchaudio==2.0.2+cu118 torchvision==0.15.2+cu118).
If you managed to fix this, please open an issue and let me know, thanks. :DDDDD
o_O I hate environment and dependencies errors.
```
🧱 Download Pretrained Models
Due to restrictions, we are not able to release the model pre-trained with in-house data. Instead, we re-train our model on public datasets, e.g. HumanVid, and other human video data for research use, e.g. Pexels.
We follow the implementation details in our paper and release pretrained weights and other network modules in this huggingface repository. After downloading, please put all of them under the pretrained_weights folder.
The Stable Diffusion 1.5 UNet can be found here and place it under pretrainedweights/initialization/unetinitialization/SD.
Your file structure should look like this:
```bash X-Dyna |----... |----pretrainedweights |----controlnet |----controlnet-checkpoint-epoch-5.ckpt |----controlnetface |----controlnet-face-checkpoint-epoch-2.ckpt |----unet |----unet-checkpoint-epoch-5.ckpt
|----initialization |----controlnetsinitialization |----controlnet |----controlv11psd15openpose |----controlnetface |----controlnet2 |----unetinitialization |----IP-Adapter |----IP-Adapter |----SD |----stable-diffusion-v1-5 |----... ```
Inference
Using Command Line
```bash cd X-Dyna
bash scripts/inference.sh ```
More Configurations
We list some explanations of configurations below:
| Argument | Default | Description |
|:----------------------:|:------------------------:|:-----------------------------------------:|
| --gpus | 0 | GPU ID for inference |
| --output | ./output | Path to save the generated video |
| --test_data_file | ./examples/example.json | Path to reference and driving data |
| --cfg | 7.5 | Classifier-free guidance scale |
| --height | 896 | Height of the generated video |
| --width | 512 | Width of the generated video |
| --infer_config | ./configs/xdyna.yaml | Path to inference model config file |
| `--negprompt| None | Negative prompt for generation |
|--length| 192 | Length of the generated video |
|--stride| 1 | Stride of driving pose and video |
|--savefps| 15 | FPS of the generated video |
|--globalseed| 42 | Random seed |
|--facecontrolnet| False | Use Face ControlNet for inference |
|--crossid| False | Cross-Identity |
|--noheadskeleton` | False | Head skeletons are not visuliazed |
Alignment
Appropriate alignment between driving video and reference image is necessary for better generation quality. E.g. see examples below:
From left to right: Reference Image, Extracted Pose from Reference Image, Driving Video, Aligned Driving Pose.
Examples
We provide some examples of aligned driving videos, human poses and reference images here. If you would like to try on your own data, please specify the paths in this file.
🔗 BibTeX
If you find X-Dyna useful for your research and applications, please cite X-Dyna using this BibTeX:
BibTeX
@article{chang2025x,
title={X-Dyna: Expressive Dynamic Human Image Animation},
author={Chang, Di and Xu, Hongyi and Xie, You and Gao, Yipeng and Kuang, Zhengfei and Cai, Shengqu and Zhang, Chenxu and Song, Guoxian and Wang, Chao and Shi, Yichun and others},
journal={arXiv preprint arXiv:2501.10021},
year={2025}
}
License
Our code is distributed under the Apache-2.0 license. See LICENSE.txt file for more information.
Acknowledgements
We appreciate the contributions from AnimateDiff, MagicPose, MimicMotion, Moore-AnimateAnyone, MagicAnimate, IP-Adapter, ControlNet, HumanVid, I2V-Adapter for their open-sourced research. We appreciate the support from Quankai Gao, Qiangeng Xu, Shen Sang, and Tiancheng Zhi for their suggestions and discussions.
IP Statement
The purpose of this work is only for research. The images and videos used in these demos are from public sources. If there is any infringement or offense, please get in touch with us (dichang@usc.edu), and we will delete it in time.
Owner
- Name: Bytedance Inc.
- Login: bytedance
- Kind: organization
- Location: Singapore
- Website: https://opensource.bytedance.com
- Twitter: ByteDanceOSS
- Repositories: 255
- Profile: https://github.com/bytedance
GitHub Events
Total
- Issues event: 11
- Watch event: 236
- Member event: 1
- Issue comment event: 15
- Push event: 4
- Public event: 1
- Fork event: 23
Last Year
- Issues event: 11
- Watch event: 236
- Member event: 1
- Issue comment event: 15
- Push event: 4
- Public event: 1
- Fork event: 23
Issues and Pull Requests
Last synced: 6 months ago
All Time
- Total issues: 7
- Total pull requests: 0
- Average time to close issues: 3 days
- Average time to close pull requests: N/A
- Total issue authors: 7
- Total pull request authors: 0
- Average comments per issue: 0.0
- Average comments per pull request: 0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0
Past Year
- Issues: 7
- Pull requests: 0
- Average time to close issues: 3 days
- Average time to close pull requests: N/A
- Issue authors: 7
- Pull request authors: 0
- Average comments per issue: 0.0
- Average comments per pull request: 0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0
Top Authors
Issue Authors
- joeyjoker (1)
- LinYuOu (1)
- Jandown (1)
- RookieCaoChao (1)
- Jeremy-J-J (1)
- zhangwenzhao (1)
- H-Black-H (1)