Diagnostic Benchmark and Iterative Inpainting for Layout-Guided Image Generation

1University of North Carolina at Chapel Hill, 2Microsoft Research

Summary

Teaser
LayoutBench evaluates layout-guided image generation models with out-of-distribution (OOD) layouts in four skills: number, position, size, and shape. Existing models (b) LDM and (c) ReCo fail on OOD layouts by misplacing objects. (d) IterInpaint, is our new baseline with better generalization on OOD layouts.
Teaser
IterInpaint is a new baseline for layout-guided image generation. Unlike previous methods that generate all objects in a single step, IterInpaint decomposes the image generation process into multiple steps and uses an inpainting model to update regions step-by-step. This decomposition makes each generation step easier by allowing the model to focus on generating a single foreground object or background.

Abstract

Spatial control is a core capability in controllable image generation, and aims to generate images that follow the spatial input configurations. Advancements in layout-to-image generation have shown promising results on in-distribution (ID) datasets with similar spatial configurations. However, it is unclear how these models perform when facing out-of-distribution (OOD) samples with arbitrary, unseen layouts.

In this paper, we propose LayoutBench, a diagnostic benchmark that examines four categories of spatial control skills: number, position, size, and shape. We benchmark two recent representative layout-guided image generation methods and observe that the good ID layout control may not generalize well to arbitrary layouts in the wild (e.g., objects at boundary).

Next, we propose IterInpaint, a new baseline that generates foreground and background regions in a step-by-step manner via inpainting, demonstrating stronger generalizability than existing models on OOD layouts in LayoutBench.

We perform quantitative and qualitative evaluation and fine-grained analysis on the four LayoutBench skills to pinpoint the weaknesses of existing models. Lastly, we show comprehensive ablation studies on IterInpaint, including training task ratio, crop&paste vs. repaint, and generation order.

LayoutBench: a New Diagnostic Benchmark for Layout-Guided Image Generation

ID vs. OOD Layouts in four spatial control skills

We measure 4 spatial control skills (number, position, size, shape), where each skill consists of 2 OOD layout splits, i.e., in total 8 tasks = 4 skills x 2 splits. To disentangle spatial control from other aspects in image generation, such as generating diverse objects, LayoutBench keeps the object configurations of CLEVR, and changes the spatial layouts. Below images show example ID (CLEVR) and OOD (LayoutBench) layouts. GT boxes are shown in blue.
Teaser

Evaluation Process with LayoutBench

We test the OOD layout skills of models trained on CLEVR (ID) dataset. First, 1) we query the image generation models with OOD layouts. Then, 2) we detect the objects from the generated images, and calculate the layout accuracy in average precision (AP), with an object detector. As shown below, existing models often misplaces objects in OOD Layouts from LayoutBench, which motivates us to proposed IterInpaint, a new baseline for layout-guided image generation.
Teaser

IterInpaint: a New Baseline for Layout-Guided Image Generation

We propose IterInpaint, a new baseline for layout-guided image generation. Unlike previous methods that generate all objects in a single step, IterInpaint decomposes the image generation process into multiple steps and uses a text-guided inpainting model to update foreground and background regions step-by-step. This decomposition makes each generation step easier by allowing the model to focus on generating a single foreground object or background. We implement IterInpaint by extending Stable Diffusion, a public text-to-image model based on LDM. To enable inpainting, we extend the U-Net of Stable Diffusion to take the mask and a context image as additional inputs.

Training

We use a single objective to cover both foreground/background inpainting by giving IterInpaint a different context image and mask: (1) foreground inpainting - from N GT objects, sample context objects to show, then sample an object to generate; (2) background inpainting - mask out all objects, and generate the background.
Teaser


Inference

We start from a blank image and iteratively update foreground objects and background. For each generation step, we provide the inpainting model (1) a context image (initialized with a blank image for the first object), (2) a text prompt, (3) a binary mask, to obtain a generated image. Then we update the image by composing it with a mask. IterInpaint generate the final image with N+1 (foreground+background) iterations. Users can control the generation order of each region and interactively manipulate the image from an intermediate generation step.
Teaser

Evaluation Results

We evaluate two recent and strong layout-guided image generation models, LDM and ReCo, and our IterInpaint. For quantitative evaluation, we measure the layout accuracy with average precision (AP) and image quality with FID/SceneFID. We also conduct qualitative evaluations, and more detailed fine-grained skill analysis.

Quantitative Evaluation - Layout Accuracy

The first row shows the layout accuracy based on GT images. Our object detector can achieve high accuracy on both CLEVR and LayoutBench datasets, showing the high reliability. The second row (GT shuffled) shows a setting where a given target layout, we randomly sample an image from the GT images to be the generated image. The 0% AP on both CLEVR and LayoutBench means that it is impossible to obtain high AP by only generating high-fidlity images but in the wrong layouts.

As shown in the bottom half of the table, while all three models achieve high layout accuracy on CLEVR, the layout accuracy drop by large marigns on LayoutBench, showing the ID-OOD layout gap. Specifically, LDM and ReCo fail substantially on LayoutBench across all skill splits, with an average drop of 57~70% per skill on AP50, compared to the high AP on in-domain CLEVR validation split. In contrast, IterInpaint generalize better to OOD layouts in LayoutBench, while maintining or even slightly improving layout accuracy on ID layouts in CLEVR.

Teaser Layout accuracy in AP/AP50(%) on CLEVR and LayoutBench. Best (highest) values are bolded.


Quantitative Evaluation - Image Quality

On CLEVR, LDM/ReCo achieves better FID/SceneFID than IterInpint, indicating that strong layout control performance of IterImpaint comes with a trade-off in these image quality metrics. However, on LayoutBench, the three models achieve similar FID scores, despite the significant layout errors of LDM and ReCo, which suggests that image quality measures alone are not sufficient for evaluating layout-guided image generation and further justify using layout accuracy to examine layout control closely.
Teaser
Image Quality in FID/SceneFID on CLEVR and LayoutBench. Best (lowest) values are bolded.


Qualitative Evaluation

On CLEVR, all three models can follow the ID layout inputs to place the correct objects precisely. On LayoutBench, LDM and ReCo often make mistakes, such as generating objects that are much smaller (e.g., Number-few) / bigger (e.g., Size-tiny, Position-center) than the given bounding boxes, and missing some objects (e.g., Number-many, Position-center, Position-boundary, Size-large). However, IterInpaint can generate objects that are more accurately aligned to the given bounding boxes in general, which are consistent with the higher layout accuracy. Especially for the extreme small bounding boxes in Size-tiny, only IterInpaint, among the three models, generates objects that fit.
Teaser Comparison of generated images on CLEVR (ID) and LayoutBench (OOD). GT boxes are shown in blue.


Fine-grained Skill Analysis

We perform a more detailed analysis on each LayoutBench skill to understand better the challenges presented in LayoutBench and to examine the weakness of each method. Specifically, we divide the 4 skills into more fine-grained splits to cover both ID (CLEVR) and OOD (LayoutBench) configurations. We sample 200 images for each split and report layout accuracy. Comparing across 4 skills, the majority of Size skill splits (except for size=2) are the least challenging, while the Position/Number skill is the most challenging. IterInpaint significantly outperforms LDM and ReCo on all splits. Among the other two, LDM has slightly higher scores than ReCo in general.
Teaser Detailed layout accuracy analysis with fine-grained splits of 4 LayoutBench skills. In-distribution (same attributes to CLEVR) splits are colored in gray, For the Shape skill, the splits are named after their height/width ratio (e.g. H2W1 split consists of the objects with a 2:1 ratio of height:width).

Additional IterInpaint Generation Examples

User-defined Layouts

We show three input layouts: (1) two rows of objects with different sizes, (2) ‘AI’ written in the text, and (3) a heart shape. While ReCo often fails to ignore or misplace some objects, IterInpaint places objects significantly more accurately. We show the generation process of these images in the bottom.

Teaser

Interactive Image Manipulation

With IterInpaint, users can interatively manipulate images with a binary mask and a text prompt, to add or remove objects at arbitrary locations.

Teaser

Crop & Paste vs. Repaint

Instead of the default crop&paste, we experiment the repainting the entire image during the inference. Repaint-based update encodes/decodes the whole image at each step and suffers from error propagation (i.e., early objects get distorted with step progress).

Crop & Paste (Default)

Teaser

Repaint

Teaser

Arbitrary-order Generation with User-defined Layouts

Existing iterative text-to-image generation models either update entire images at each step (e.g., Stable Diffusion), or update fixed-shaped patches and are sensitive in generative orders (e.g., DALL-E, X-LXMERT). In contrast, IterInpaint is able to generate images by updating differnt-shaped regions sizes in arbitrary orders. We experiment with different generation orders (e.g., random, top-to-bottom, bottom-to-top, big-to-small, small-to-big etc.), and found that IterInpaint is not sensitive in generation orders. This allows users to manipulate object layouts without the need to re-generate images from scratch.

Two rows (8 objects)

Teaser
Teaser
Teaser
Teaser
Teaser

"AI" (16 objects)

Teaser
Teaser
Teaser
Teaser
Teaser

Heart (20 objects)

Teaser
Teaser
Teaser
Teaser
Teaser

COCO Experiments

Although our main focus is constructing a diagnostic LayoutBench benchmark with full scene control and evaluating the layout-guided image generation models, we also test whether IterInpaint could also perform well on real images on COCO dataset. In the left table, we show some image generation samples from ReCo and IterInpaint from COCO layouts, where both models could locate objects in the correct positions. In the right table, we show some arbitrary custom layouts with COCO objects. While both models are correct in object locations, ReCo sometimes fails to place wrong objects that are frequent in a given layout, while IterInpaint shows a more precise object recognition performance.

In-Distribution Layouts

Teaser

Out-of-Distribution Layouts

Teaser

Citation

Please cite our paper if you use our dataset and/or method in your projects.

@article{Cho2023LayoutBench,
  author = {Jaemin Cho and Linjie Li and Zhengyuan Yang and Zhe Gan and Lijuan Wang and Mohit Bansal},
  title = {Diagnostic Benchmark and Iterative Inpainting for Layout-Guided Image Generation},
  year = {2023},
}