UniGP simultaneously models RGB and dense distributions in a single diffusion transformer, supporting joint generation, dense perception, and any-condition text-to-image generation.
Generate RGB images from spatial conditions such as depth, normal, sketch, or pose.
Estimate depth and surface normals from RGB images using the same unified model.
Sample aligned RGB, depth, and normal outputs from text in a single generative process.
UniGP is evaluated across controllable generation, dense prediction, joint generation, and multi-condition joint generation. A single model handles all these settings while preserving geometric consistency and high-quality RGB synthesis.
UniGP adds a Disentangled Unified Generation and Perception branch to MMDiT. The backbone remains a strong generative prior, while the copied image branch learns non-RGB visual distributions for depth and surface-normal modeling.
Compared with duplicating the entire backbone or directly fine-tuning the generative model, UniGP reuses only the image branch needed for additional visual distributions and isolates dense-prediction gradients from the generative backbone.
Joint training creates complementary benefits: perception supervision sharpens structural alignment for generation, while generation training improves perception outputs with richer visual details.
@inproceedings{guo2026unigp,
title = {{UniGP}: Taming Diffusion Transformer for Prior-Preserved Unified Generation and Perception},
author = {Guo, Qin and Luo, Hao and Yue, Dongxu and Jin, Weixuan and Fu, Xiao and Wang, Fan and Xu, Dan},
booktitle = {European Conference on Computer Vision (ECCV)},
year = {2026}
}