IMPRINT: Generative Object Compositing by Learning Identity-Preserving Representation

Abstract

Generative object compositing emerges as a promising new avenue for compositional image editing. However, the requirement of object identity preservation poses a significant challenge, limiting practical usage of most existing methods. In response, this paper introduces IMPRINT, a novel diffusion-based generative model trained with a two-stage learning framework that decouples learning of identity preservation from that of compositing. The first stage is targeted for context-agnostic, identity-preserving pretraining of the object encoder, enabling the encoder to learn an embedding that is both view-invariant and conducive to enhanced detail preservation. The subsequent stage leverages this representation to learn seamless harmonization of the object composited to the background. In addition, IMPRINT incorporates a shape-guidance mechanism offering user-directed control over the compositing process. Extensive experiments demonstrate that IMPRINT significantly outperforms existing methods and various baselines on identity preservation and composition quality.

The Two-stage Pipeline

Stage 1

Stage of context-agnostic ID-preserving: we design a novel image encoder (with pre-trained DINOv2 as backbone) trained on multi-view object pairs to learn view-invariant ID-preserving representation.

Stage 2

Stage of object compositing: taking the learned image encoder from the first stage and freezing its backbone, the whole model is trained for compositing the object to the masked region (see Fig. 3 for the blending process).

Figure 2. The two-stage training pipeline of the proposed IMPRINT.

The Background-blending Process

Figure 3. Illustration of the background-blending process. At each denoising step, the background area of the denoised latent is masked and blended with unmasked area from the clean background (intuitively, the model is only denoising the foreground).

Shape-guided Controllable Compositing

Figure 4. More shape-control results. IMPRINT introduces more user control by using a user-provided mask as input. Inspired by SmartBrush, we define four types of mask (including bounding box). In addition to object compositing, our model also performs edits on the input object. Depending on the shape of the coarse mask, IMPRINT can operate different types of editing, including changing the view of an object, and applying non-rigid transformation on the object.

The ID-preserving Representation

Figure 5. Top: Results of context-agnostic ID-preserving pretraining (after the first stage); IMPRINT generates view pose changes while memorizing the details of the object. Bottom: Diverse poses of the object after the second stage.

BibTeX


@article{song2024imprint,
    title={IMPRINT: Generative Object Compositing by Learning Identity-Preserving Representation},
    author={Song, Yizhi and Zhang, Zhifei and Lin, Zhe and Cohen, Scott and Price, Brian and Zhang, Jianming and Kim, Soo Ye and Zhang, He and Xiong, Wei and Aliaga, Daniel},
    journal={arXiv preprint arXiv:2403.10701},
    year={2024}
}