Jiehui Huang1 ยท
Yuechen Zhang2 ยท
Xu He3 ยท
Yuan Gao4 ยท
Zhi Cen4 ยท
Bin Xia2 ยท
Yan Zhou4 ยท
Xin Tao4 ยท
Pengfei Wan4 ยท
Jiaya Jia1,โ
1HKUST ยท 2CUHK ยท 3Tsinghua University ยท 4Kling Team, Kuaishou Technology
โCorresponding Author
- [2025.12.15] ๐ Part of the OpenUni dataset is now open-sourced! Check it out on ๐ค Hugging Face
- [2025.12.08] ๐ฅ arXiv paper released !
UnityVideo is a unified generalist framework for multi-task multi-modal video understanding that enables:
- ๐จ Text-to-Video Generation: Create high-quality videos from text descriptions
- ๐ฎ Controllable Generation: Fine-grained control over video generation with various modalities
- ๐ Modality Estimation: Estimate depth, normal, and other modalities from video
- ๐ Zero-Shot Generalization: Strong generalization to novel objects and styles without additional training
Our unified architecture achieves state-of-the-art performance across multiple video generation benchmarks while maintaining efficiency and scalability.
- โ Unified Framework: Single model handles multiple video understanding tasks
- โ Multi-Modal Support: Seamlessly processes text, image, and video inputs
- โ World-Aware Generation: Enhanced physical understanding and consistency
- โ Flexible Control: Support for various control signals (depth, edge, pose, etc.)
- โ High Quality: State-of-the-art visual quality and temporal consistency
- โ Efficient Training: Joint multi-task learning improves data efficiency
UnityVideo employs a unified multi-modal multi-task learning framework that consists of:
- Multi-Modal Encoder: Processes diverse input modalities (text, image, video)
- Unified Transformer Backbone: Shared representation learning across tasks
- Task-Specific Heads: Specialized decoders for different generation and estimation tasks
- Joint Training Strategy: Simultaneous optimization across all tasks
This architecture enables knowledge sharing and improves generalization across different video understanding tasks.
| More examples coming Soon |
| More examples coming Soon |
| More examples coming Soon |
- Release training code
- Release inference code
- Release pretrained models
- Add Gradio demo, Colab notebook, and more usage examples
- Release data
- Release arXiv paper
This repository is released under the Apache-2.0 license as found in the LICENSE file.
Follow this project to get notified when we release the code!
If you find this work useful for your research, please cite:
@article{huang2024unityvideo,
title={UnityVideo: Unified Multi-Modal Multi-Task Learning for Enhancing World-Aware Video Generation},
author={Huang, Jiehui and Zhang, Yuechen and He, Xu and Gao, Yuan and Cen, Zhi and Xia, Bin and Zhou, Yan and Tao, Xin and Wan, Pengfei and Jia, Jiaya},
journal={arXiv preprint arXiv:2512.07831},
year={2025}
}

