Refine-by-Align: Reference-Guided Artifacts Refinement through Semantic Alignment

Purdue University1, Adobe2

Figure 1. Refine-by-Align. Given a generated image (with artifacts), a free-form mask indicating the artifacts region in the generated image, and a high-quality reference image containing important details such as identity logo or font, our model can automatically refine the artifacts in the generated image by leveraging the corresponding details from the reference. The proposed method could benefit various applications (e.g., DreamBooth for text-to-image customization, IDM-VTON for virtual try-on, AnyDoor for object composition, and Zero123++ for novel view synthesis).

Abstract

Personalized image generation has emerged from the recent advancements in generative models. However, these generated personalized images often suffer from localized artifacts such as incorrect logos, reducing fidelity and fine-grained identity details of the generated results. Furthermore, there is little prior work tackling this problem. To help improve these identity details in the personalized image generation, we introduce a new task: reference-guided artifacts refinement. We present Refine-by-Align, a first-of-its-kind model that employs a diffusion-based framework to address this challenge. Our model consists of two stages: Alignment Stage and Refinement Stage, which share weights of a unified neural network model. Given a generated image, a masked artifact region, and a reference image, the alignment stage identifies and extracts the corresponding regional features in the reference, which are then used by the refinement stage to fix the artifacts. Our model-agnostic pipeline requires no test-time tuning or optimization. It automatically enhances image fidelity and reference identity in the generated image, generalizing well to existing models on various tasks including but not limited to customization, generative compositing, view synthesis, and virtual try-on. Extensive experiments and comparisons demonstrate that our pipeline greatly pushes the boundary of fine details in the image synthesis models.

The Pipeline

Figure 2. Top: During training, we train a DM for object completion, guided by a reference image. In alignment mode, the reference is a complete object, so the model learns to locate the relevant region from the reference for object completion, thus maximizing the spatial correlation in attention maps. In refinement mode, this region is directly provided as reference. Bottom: During inference, the inputs include a generated image with the artifacts marked, and a reference object. In the alignment stage, we perform cross-attention alignment algorithm to find the correspondence map. In the refinement stage, the correspondence map is used to find the region in the reference that corresponds to artifacts, which guides refining.

Visualization of the Proposed Algorithm

Figure 3. Top: Visualization of our cross-attention alignment algorithm. The artifacts mask is used to extract the spatial correlations between the artifacts and the reference; the output of this algorithm, the correspondence map, indicates the region in the reference that corresponds to the artifacts area. Middle and Bottom: Correspondence maps across different transformer layers and timesteps.

Algorithm of Cross-Attention Alignment

Figure 4.

mIoU across Timestep and Transformer Layer

Figure 5. Running the cross-attention alignment algorithm on the test set to find the best combination of timestep and transformer layer. Left: mIoU across all timesteps, averaged over all layers and images; Right:: mIoU across all layers, averaged over all timesteps and images.

Qualitative Comparisons

Figure 6. Qualitative comparisons. Note that the accurate reference regions corresponding to the artifacts (not the complete reference) are provided to PbE, OS and AnyDoor. In the second row of the references, we overlay the correspondence maps on them. Compared with the baselines, our model not only preserves identity (most similar to the second row), but also generate smooth and natural results where artifacts are significantly reduced.

BibTeX


        Coming soon!