UniGP

Taming Diffusion Transformer for Prior-Preserved Unified Generation and Perception

1The Hong Kong University of Science and Technology 2DAMO Academy, Alibaba Group, Zhejiang, China 3Hupan Lab, Zhejiang Province 4Zhejiang University, Zhejiang, China 5Tsinghua University 6The Chinese University of Hong Kong 7Zeekr Automobile R&D Co., Ltd.
†Corresponding author
UniGP teaser figure

UniGP simultaneously models RGB and dense distributions in a single diffusion transformer, supporting joint generation, dense perception, and any-condition text-to-image generation.

Abstract

Recent advances in diffusion models have shown impressive performance in controllable image generation and dense prediction tasks. However, existing approaches typically treat diffusion-based controllable generation and dense prediction as separate tasks, overlooking the potential benefits of jointly modeling heterogeneous distributions. We introduce UniGP, a framework built upon MMDiT that unifies controllable generation and dense prediction through simple joint training, without complex task-specific designs or losses, while preserving the backbone's versatile priors.

Controllable Generation

Generate RGB images from spatial conditions such as depth, normal, sketch, or pose.

Dense Perception

Estimate depth and surface normals from RGB images using the same unified model.

Joint Generation

Sample aligned RGB, depth, and normal outputs from text in a single generative process.

Main Results

UniGP is evaluated across controllable generation, dense prediction, joint generation, and multi-condition joint generation. A single model handles all these settings while preserving geometric consistency and high-quality RGB synthesis.

UniGP qualitative results

Methodology

UniGP adds a Disentangled Unified Generation and Perception branch to MMDiT. The backbone remains a strong generative prior, while the copied image branch learns non-RGB visual distributions for depth and surface-normal modeling.

Framework of UniGP
UniGP representative design paradigms
The MMDiT block and initialization of DUGP

Compared with duplicating the entire backbone or directly fine-tuning the generative model, UniGP reuses only the image branch needed for additional visual distributions and isolates dense-prediction gradients from the generative backbone.

Ablation Study

Joint training creates complementary benefits: perception supervision sharpens structural alignment for generation, while generation training improves perception outputs with richer visual details.

UniGP ablation study

BibTeX

@inproceedings{guo2026unigp,
  title     = {{UniGP}: Taming Diffusion Transformer for Prior-Preserved Unified Generation and Perception},
  author    = {Guo, Qin and Luo, Hao and Yue, Dongxu and Jin, Weixuan and Fu, Xiao and Wang, Fan and Xu, Dan},
  booktitle = {European Conference on Computer Vision (ECCV)},
  year      = {2026}
}