UniGP

Taming Diffusion Transformer for Prior-Preserved Unified Generation and Perception

Qin Guo¹, Hao Luo^2,3,4, Dongxu Yue⁵, Weixuan Jin⁵, Xiao Fu⁶, Fan Wang², Dan Xu^1,7†

¹The Hong Kong University of Science and Technology ²DAMO Academy, Alibaba Group, Zhejiang, China ³Hupan Lab, Zhejiang Province ⁴Zhejiang University, Zhejiang, China ⁵Tsinghua University ⁶The Chinese University of Hong Kong ⁷Zeekr Automobile R&D Co., Ltd.

†Corresponding author

Paper Code BibTeX

UniGP simultaneously models RGB and dense distributions in a single diffusion transformer, supporting joint generation, dense perception, and any-condition text-to-image generation.

Abstract

Recent advances in diffusion models have shown impressive performance in controllable image generation and dense prediction tasks. However, existing approaches typically treat diffusion-based controllable generation and dense prediction as separate tasks, overlooking the potential benefits of jointly modeling heterogeneous distributions. We introduce UniGP, a framework built upon MMDiT that unifies controllable generation and dense prediction through simple joint training, without complex task-specific designs or losses, while preserving the backbone's versatile priors.

Controllable Generation

Generate RGB images from spatial conditions such as depth, normal, sketch, or pose.

Dense Perception

Estimate depth and surface normals from RGB images using the same unified model.

Joint Generation

Sample aligned RGB, depth, and normal outputs from text in a single generative process.

Main Results

UniGP is evaluated across controllable generation, dense prediction, joint generation, and multi-condition joint generation. A single model handles all these settings while preserving geometric consistency and high-quality RGB synthesis.

Methodology

UniGP adds a Disentangled Unified Generation and Perception branch to MMDiT. The backbone remains a strong generative prior, while the copied image branch learns non-RGB visual distributions for depth and surface-normal modeling.

The MMDiT block and initialization of DUGP

Compared with duplicating the entire backbone or directly fine-tuning the generative model, UniGP reuses only the image branch needed for additional visual distributions and isolates dense-prediction gradients from the generative backbone.

Ablation Study

Joint training creates complementary benefits: perception supervision sharpens structural alignment for generation, while generation training improves perception outputs with richer visual details.

BibTeX

@inproceedings{guo2026unigp,
  title     = {{UniGP}: Taming Diffusion Transformer for Prior-Preserved Unified Generation and Perception},
  author    = {Guo, Qin and Luo, Hao and Yue, Dongxu and Jin, Weixuan and Fu, Xiao and Wang, Fan and Xu, Dan},
  booktitle = {European Conference on Computer Vision (ECCV)},
  year      = {2026}
}