TrailBlazer: Trajectory Control for Diffusion-Based Video Generation

Wan-Duo Kurt Ma¹, J. P. Lewis², W. Bastiaan Kleijn¹,

¹Victoria University of Wellington, ²NVIDIA Research

TrailBlazer features text-to-video diffusion video editing using a pre-trained model without further model training, finetuning, or online optimization. It supports various user experiences as depicted.

Abstract

Within recent approaches to text-to-video (T2V) generation, achieving controllability in the synthesized video is often a challenge. Typically, this issue is addressed by providing low-level per-frame guidance in the form of edge maps, depth maps, or an existing video to be altered. However, the process of obtaining such guidance can be labor-intensive. This paper focuses on enhancing controllability in video synthesis by employing straightforward bounding boxes to guide the subject in various ways, all without the need for neural network training, finetuning, optimization at inference time, or the use of pre-existing videos.

Our algorithm, TrailBlazer, is constructed upon a pre-trained (T2V) model, and easy to implement. The subject is directed by a bounding box through the proposed spatial and temporal attention map editing. Moreover, we introduce the concept of keyframing, allowing the subject trajectory, morphing, and overall appearance to be guided by both a moving bounding box and corresponding prompts, without the need to provide a detailed mask. The method is efficient, with negligible additional computation relative to the underlying pre-trained model. Despite the simplicity of the bounding box guidance, the resulting motion is surprisingly natural, with emergent effects including perspective and movement toward the virtual camera as the box size increases.

Update

[2024/Apr/08] TrailBlazer has new v2 preprint on ArXiv
[2024/Mar/22] TrailBlazer is about to update both the ArXiv paper and its codebase in the upcoming week.
[2024/Feb/06] We now have Gradio web app at Huggingface Space!
[2024/Jan/03] TrailBlazer v1 is released on ArXiv.
[2023/Dec/31] TrailBlazer is submitted on ArXiv..

Core Method

TrailBlazer highlights the central components of spatial cross-attention editing (left, in the almond-colored section) and temporal cross-frame attention editing (right, in the blue section). This operation is exclusively applied during the denoising process in the early stage. The objective is to alter the attention map within a user-specified bounding box (bbox). For more in-depth information, please consult our main text.

Scene compositing

Scene compositing allows the motion of several subjects to be simultaneously controlled.This algorithm first computes the initial denoising steps of each subject individually. The first figure below shows the synthesis of “a white cat” and “a yellow dog” individually, serving as a sanity check for the quality of the subjects.

Then, these per-subject intermediate results are composited and processed by a global denoising under the control of a complete prompt (“a white cat and a yellow dog…”) that includes a description of the environment (e.g., “...on the moon”). Note that interactions between the background and subjects appear plausible, as seen in the consistent shadows across all samples.

Keyframing

The bounding boxes and prompts can be animated via keyframes, enabling users to alter the trajectory and coarse behavior of the subject along the timeline. The resulting subject(s) fit seamlessly in the specified environment, providing a viable pipeline for video storytelling by casual users.

Please be aware that the annotated bounding boxes in all experiments below have been manually animated to enhance the viewing experience.

TrailBlazer features a novel way to guide the synthesized subject through bbox keyframing. For instance, the user can animate the fish swimming towards the camera and then goes away. Or, the user can control the cat running speed through keyframing.

In addition, TrailBlazer demonstrates subject morphing via prompt keyframing. Examples include transformations from a cat to a dog, cat to fish, parrot to penguin, and tiger to elephant, as depicted below.

Extreme Conditions with Peekaboo

Here, we present a comparative analysis between TrailBlazer and the previous approach, Peekaboo, under controlled conditions. In particular, we examine the manipulation of bounding box (bbox) keyframing in extreme scenarios, including rapid changes in bbox size, irregular bbox trajectories, swift motion determined by the middle keyframe, the number of keyframes required to move the subject to its opposite side, and static small bbox.

Please see our paper for more detail and the full metric (e.g., FID, FVD, mIoU,..) comparisons and reasonings.

E.g., Our representation of an elephant maintains a stationary position for initial 75% of video before initiating movement.

Prompt: An elephant walking on the moon

E.g., The whale gracefully descends into the ocean during the latter part of its jumping motion.

Prompt: a photorealistic whale jumping out of water while smoking a cigar

E.g., The horse accurately follows a zigzag path, simulating a galloping motion.

Prompt: A horse galloping fast on a street

E.g., Remarkably, the dog seamlessly follows a large number of keyframes (8 keyframes) within a 24-frame video clip, covering the distance from one boundary to the opposite in approximately 2 time frames.

Prompt: A dog is running on the grass

E.g., The clownfish fits into a tiny bbox

Prompt: A clownfish swimming in a coral ree

Limitations

TrailBlazer inherits the limitations of the underlying pre-trained model (ZeroScope). These include animals with an incorrect number of limbs and other issues common to a number of diffusion-based T2I and T2V methods.

Conclusion

Our contributions are listed below:

Novelty: We introduce a novel approach, TrailBlazer, employing high-level bounding boxes to guide the subject in diffusion-based video synthesis. This approach is suitable for casual users, as it avoids the need to record or draw a frame-by-frame positioning control signal. In contrast, the low-level guidance signals (detailed masks, edge maps) used by some other approaches have two disadvantages: it is difficult for non-artists to draw these shapes, and processing existing videos to obtain these signals limits the available motion to copies of existing sources.

Trajectory control: TrailBlazer enables users to position the subject by keyframing its bounding box. The size of the bbox can be similarly controlled, thereby producing directional motion and perspective effects. Finally, users can also keyframe the text prompt to influence the behavior of the subject in the synthesized video.

Simplicity: TrailBlazer operates by directly editing the spatial and temporal attention in the pre-trained denoising UNet. It requiring no training or optimization, and the core algorithm can be implemented in less than 200 lines of code.

Citation

TrailBlazer will be further enhanced to improve the quality and usability. If you find our work interesting, please cite our article.

BibTeX


      @misc{ma2023trailblazer,
            title={TrailBlazer: Trajectory Control for Diffusion-Based Video Generation},
            author={Wan-Duo Kurt Ma and J. P. Lewis and W. Bastiaan Kleijn},
            year={2023},
            eprint={2401.00896},
            archivePrefix={arXiv},
            primaryClass={cs.CV}
      }