Directed Diffusion: Direct Control of Object Placement through Attention Guidance

DD builds on the following observation: The overall position and shape of a synthesized objects appears near the beginning of the diffusion denoising process, while the final steps do not change this but keep refining the details.

Intermediate reconstructions and visualizations of prompt token cross-attention maps. Prompt: *A cat sitting on a car.* **(Top-left)** VAE reconstruction from the predicted noise at each step; **(Bottom)** The cross attention associated with the six words in the prompt; **(Top-right)** The denoising progress at steps 50,40,25,0. The cat's position and shape is established early in the denoising, as seen in the red box at step 40.

DD can be divided into two stages: Attention Editing and Convention SD Denoising . Our Attention Editing directly strengthens and weakens the activations in a specified region of selected cross-attention maps through optimization. Trailing attention maps are those that do not correspond to words in the prompt. These play a central role in establishing consistency between the subjects and the surrounding environment shown in the image. The optimal number of trailing attention maps are found by minimizing our objective that contributes the best directed object in the prompt attention maps.

The following clip shows the directed object allocating in each of four image quadrants as well as at the image center, with the directed object a large cabin, an insect robot, and the sun, compared with conventional SD result shown at beginning.

The following clip shows the moving object from left to right. All the results are generated with a bounding box with the full image height and 40% of image width, with the directed object A stone castle, A white cat, A white dog, compared with conventional SD result shown at beginning.

In some cases the artist may wish to experiment with different object positions after obtaining a desirable image. However, when the object’s bounding box is moved DD may generate a somewhat different object. Our Placement Finetuning (PF) method addresses this problem, allowing the artist to immediately experiment with different positions for an object while keeping its identity, and without requiring any model fine-tuning or other optimization.

Compositionality is a well known failure case for SD, e.g. the Huggingface page specifically mentions the example of “A red cube on top of a blue sphere”. Here we use DD to successfully produce this example, by using two bounding boxes for the sphere and cube respectively.

While Directed Diffusion can be used to address Stable Diffusion's problem with composing multiple objects, it inherits Stable Diffusion's other limitations -- in particular the need for experimentation with prompts and hyperparameters. "This is the thing with AI tools right now - it takes a ton of time and a willingness to keep experimenting." In practice our method would be used in conjunction with other algorithms such as Dreambooth. The examples in the paper do not make use of these other algorithms in order to provide a stand-alone evaluation of our algorithm.

The aim of this project is to extend the capability of text-to-image models to visual storytelling. In common with most other technologies, there is potential for misuse. In particular, generative models reflect the biases of their training data, and there is a risk that malicious parties can use text-to-image methods to generate misleading images. The authors believe that it is important that these risks be addressed. This might be done by penalizing malicious behavior though the introduction of appropriate laws, or by limiting the capabilities of generative models to serve this behavior.

We summarize the contributions of Directed Diffusion:

Storytelling : Our method is a first step towards storytelling by providing consistent control over the positioning of multiple objects.
Compositionality : It provides an alternate and direct approach to ''compositionality'' by providing explicit positional control.
Consistency : The positioned objects seamlessly and consistently fit in the environment, rather than appearing as a splice from another image with inconsistent interaction (shadows, lighting, etc.) This consistency is due to two factors.
Simplicity : From a computational point of view, our method requires no training or fine tuning, and can be added to an existing text-driven diffusion model with cross-attention guidance with only a few lines of code. It requires a simple optimization of a small weight vector $\mathbf{a} ∈ \mathbb{R}^d$, $d < 77$, which does not significantly increase the overall synthesis time.

We thank Jason Baldridge, and the anonymous reviewers for helpful feedback. We also thank Huggingface public project Diffusers providing efficient and reliable experiments.

If you find our DirectedDiffusion interesting, please consider citing our article.

Directed Diffusion: Direct Control of Object Placement through Attention Guidance

Directed Diffusion (DD) augments denoising diffusion text-to-image generation by allowing the position of specified objects to be controlled with user-specified bounding boxes highlighted in red.

Abstract

Update

Background

Approach

Key Result - One object: Four quadrants

Key Results - One object: Sliding window

Key Results - Placement Finetuning

Key Results - Two bounding boxes

Limitations

Societal Impact

Contribution

Acknowledgements

Citation

BibTeX