โ— PHANTOM
๐Ÿ‡ฎ๐Ÿ‡ณ IN
โœ•
Skip to content

This project is the official implementation of "UnityVideo: Unified Multi-Modal Multi-Task Learning for Enhancing World-Aware Video Generation"

License

Notifications You must be signed in to change notification settings

JIA-Lab-research/UnityVideo

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

14 Commits
 
 
 
 
 
 

Repository files navigation

UnityVideo Logo

UnityVideo : Unified Multi-Modal Multi-Task Learning for Enhancing World-Aware Video Generation

arXiv Project Page License Model Dataset ้‡ๅญไฝ

Jiehui Huang1 ยท Yuechen Zhang2 ยท Xu He3 ยท Yuan Gao4 ยท Zhi Cen4 ยท Bin Xia2 ยท
Yan Zhou4 ยท Xin Tao4 ยท Pengfei Wan4 ยท Jiaya Jia1,โœ‰

1HKUST ยท 2CUHK ยท 3Tsinghua University ยท 4Kling Team, Kuaishou Technology

โœ‰Corresponding Author


๐Ÿ“ข Code will be released soon! Stay tuned! ๐Ÿš€


๐Ÿ“ข News

  • [2025.12.15] ๐ŸŽ‰ Part of the OpenUni dataset is now open-sourced! Check it out on ๐Ÿค— Hugging Face
  • [2025.12.08] ๐Ÿ”ฅ arXiv paper released !

๐Ÿ“– Introduction

UnityVideo is a unified generalist framework for multi-task multi-modal video understanding that enables:

  • ๐ŸŽจ Text-to-Video Generation: Create high-quality videos from text descriptions
  • ๐ŸŽฎ Controllable Generation: Fine-grained control over video generation with various modalities
  • ๐Ÿ” Modality Estimation: Estimate depth, normal, and other modalities from video
  • ๐ŸŒŸ Zero-Shot Generalization: Strong generalization to novel objects and styles without additional training

Our unified architecture achieves state-of-the-art performance across multiple video generation benchmarks while maintaining efficiency and scalability.


๐Ÿ”ฅ Highlights

  • โœ… Unified Framework: Single model handles multiple video understanding tasks
  • โœ… Multi-Modal Support: Seamlessly processes text, image, and video inputs
  • โœ… World-Aware Generation: Enhanced physical understanding and consistency
  • โœ… Flexible Control: Support for various control signals (depth, edge, pose, etc.)
  • โœ… High Quality: State-of-the-art visual quality and temporal consistency
  • โœ… Efficient Training: Joint multi-task learning improves data efficiency

๐ŸŽฏ Method

UnityVideo employs a unified multi-modal multi-task learning framework that consists of:

  1. Multi-Modal Encoder: Processes diverse input modalities (text, image, video)
  2. Unified Transformer Backbone: Shared representation learning across tasks
  3. Task-Specific Heads: Specialized decoders for different generation and estimation tasks
  4. Joint Training Strategy: Simultaneous optimization across all tasks

This architecture enables knowledge sharing and improves generalization across different video understanding tasks.


๐Ÿ“Š Results Gallery

๐ŸŽฌ Text-to-Video Generation

More examples coming Soon

๐ŸŽฎ Controllable Generation

More examples coming Soon

๐Ÿ” Modality Estimation

More examples coming Soon

๐Ÿ—“๏ธ TODO List

  • Release training code
  • Release inference code
  • Release pretrained models
  • Add Gradio demo, Colab notebook, and more usage examples
  • Release data
  • Release arXiv paper

โš–๏ธ License

This repository is released under the Apache-2.0 license as found in the LICENSE file.

๐Ÿš€ Stay Tuned for Updates!

Follow this project to get notified when we release the code!


๐Ÿ“š Citation

If you find this work useful for your research, please cite:

@article{huang2024unityvideo,
  title={UnityVideo: Unified Multi-Modal Multi-Task Learning for Enhancing World-Aware Video Generation},
  author={Huang, Jiehui and Zhang, Yuechen and He, Xu and Gao, Yuan and Cen, Zhi and Xia, Bin and Zhou, Yan and Tao, Xin and Wan, Pengfei and Jia, Jiaya},
  journal={arXiv preprint arXiv:2512.07831},
  year={2025}
}

About

This project is the official implementation of "UnityVideo: Unified Multi-Modal Multi-Task Learning for Enhancing World-Aware Video Generation"

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published