Personalized image generation has emerged from the recent advancements in generative models. However, these generated personalized images often suffer from localized artifacts such as incorrect logos, reducing fidelity and fine-grained identity details of the generated results. Furthermore, there is little prior work tackling this problem. To help improve these identity details in the personalized image generation, we introduce a new task: reference-guided artifacts refinement. We present Refine-by-Align, a first-of-its-kind model that employs a diffusion-based framework to address this challenge. Our model consists of two stages: Alignment Stage and Refinement Stage, which share weights of a unified neural network model. Given a generated image, a masked artifact region, and a reference image, the alignment stage identifies and extracts the corresponding regional features in the reference, which are then used by the refinement stage to fix the artifacts. Our model-agnostic pipeline requires no test-time tuning or optimization. It automatically enhances image fidelity and reference identity in the generated image, generalizing well to existing models on various tasks including but not limited to customization, generative compositing, view synthesis, and virtual try-on. Extensive experiments and comparisons demonstrate that our pipeline greatly pushes the boundary of fine details in the image synthesis models.
Figure 2. Top: During training, we train a DM for object completion, guided by a reference image. In alignment mode, the reference is a complete object, so the model learns to locate the relevant region from the reference for object completion, thus maximizing the spatial correlation in attention maps. In refinement mode, this region is directly provided as reference. Bottom: During inference, the inputs include a generated image with the artifacts marked, and a reference object. In the alignment stage, we perform cross-attention alignment algorithm to find the correspondence map. In the refinement stage, the correspondence map is used to find the region in the reference that corresponds to artifacts, which guides refining.
Figure 3. Top: Visualization of our cross-attention alignment algorithm. The artifacts mask is used to extract the spatial correlations between the artifacts and the reference; the output of this algorithm, the correspondence map, indicates the region in the reference that corresponds to the artifacts area. Middle and Bottom: Correspondence maps across different transformer layers and timesteps.
Figure 4.
Figure 5. Running the cross-attention alignment algorithm on the test set to find the best combination of timestep and transformer layer. Left: mIoU across all timesteps, averaged over all layers and images; Right:: mIoU across all layers, averaged over all timesteps and images.
Figure 6. Qualitative comparisons. Note that the accurate reference regions corresponding to the artifacts (not the complete reference) are provided to PbE, OS and AnyDoor. In the second row of the references, we overlay the correspondence maps on them. Compared with the baselines, our model not only preserves identity (most similar to the second row), but also generate smooth and natural results where artifacts are significantly reduced.
Coming soon!