TrailBlazer: Trajectory Control for Diffusion-Based Video Generation

1Victoria University of Wellington,   2NVIDIA Research

ACM Siggraph Asia 2024 @ Tokyo

TrailBlazer features text-to-video diffusion video editing using a pre-trained model without further model training, finetuning, or online optimization. It supports various user experiences as depicted.

Abstract

Large text-to-video (T2V) models such as Sora have the potential to revolutionize visual effects and the creation of some types of movies. Current T2V models require tedious trial-and-error experimentation to achieve desired results, however. This motivates the search for methods to directly control desired attributes. In this work, we take a step toward this goal, introducing a method for high-level, temporally-coherent control over the basic trajectories and appearance of objects. Our algorithm, TrailBlazer, allows the general positions and (optionally) appearance of objects to controlled simply by keyframing approximate bounding boxes and (optionally) their corresponding prompts.

Importantly, our method does not require a pre-existing control video signal that already contains an accurate outline of the desired motion, yet the synthesized motion is surprisingly natural with emergent effects including perspective and movement toward the virtual camera as the box size increases. The method is efficient, making use of a pre-trained T2V model and requiring no training or fine-tuning, with negligible additional computation. Specifically, the bounding box controls are used as soft masks to guide manipulation of the self-attention and cross-attention modules in the video model. While our visual results are limited by those of the underlying model, the algorithm may generalize to future models that use standard self- and cross-attention components.

Update

  • [2024/Apr/08] TrailBlazer has new v2 preprint on ArXiv
  • [2024/Mar/22] TrailBlazer is about to update both the ArXiv paper and its codebase in the upcoming week.
  • [2024/Feb/06] We now have Gradio web app at Huggingface Space!
  • [2024/Jan/03] TrailBlazer v1 is released on ArXiv.
  • [2023/Dec/31] TrailBlazer is submitted on ArXiv..


Core Method

TrailBlazer highlights the central components of spatial cross-attention editing (left, in the almond-colored section) and temporal cross-frame attention editing (right, in the blue section). This operation is exclusively applied during the denoising process in the early stage. The objective is to alter the attention map within a user-specified bounding box (bbox). For more in-depth information, please consult our main text.


Scene compositing

Scene compositing allows the motion of several subjects to be simultaneously controlled.This algorithm first computes the initial denoising steps of each subject individually. The first figure below shows the synthesis of “a white cat” and “a yellow dog” individually, serving as a sanity check for the quality of the subjects.

Then, these per-subject intermediate results are composited and processed by a global denoising under the control of a complete prompt (“a white cat and a yellow dog…”) that includes a description of the environment (e.g., “...on the moon”). Note that interactions between the background and subjects appear plausible, as seen in the consistent shadows across all samples.




Keyframing

The bounding boxes and prompts can be animated via keyframes, enabling users to alter the trajectory and coarse behavior of the subject along the timeline. The resulting subject(s) fit seamlessly in the specified environment, providing a viable pipeline for video storytelling by casual users.

Please be aware that the annotated bounding boxes in all experiments below have been manually animated to enhance the viewing experience.




TrailBlazer features a novel way to guide the synthesized subject through bbox keyframing. For instance, the user can animate the fish swimming towards the camera and then goes away. Or, the user can control the cat running speed through keyframing.




In addition, TrailBlazer demonstrates subject morphing via prompt keyframing. Examples include transformations from a cat to a dog, cat to fish, parrot to penguin, and tiger to elephant, as depicted below.




Extreme Conditions with Peekaboo

Here, we present a comparative analysis between TrailBlazer and the previous approach, Peekaboo, under controlled conditions. In particular, we examine the manipulation of bounding box (bbox) keyframing in extreme scenarios, including rapid changes in bbox size, irregular bbox trajectories, swift motion determined by the middle keyframe, the number of keyframes required to move the subject to its opposite side, and static small bbox.

Please see our paper for more detail and the full metric (e.g., FID, FVD, mIoU,..) comparisons and reasonings.



E.g., Our representation of an elephant maintains a stationary position for initial 75% of video before initiating movement.

Prompt: An elephant walking on the moon



E.g., The whale gracefully descends into the ocean during the latter part of its jumping motion.

Prompt: a photorealistic whale jumping out of water while smoking a cigar



E.g., The horse accurately follows a zigzag path, simulating a galloping motion.

Prompt: A horse galloping fast on a street



E.g., Remarkably, the dog seamlessly follows a large number of keyframes (8 keyframes) within a 24-frame video clip, covering the distance from one boundary to the opposite in approximately 2 time frames.

Prompt: A dog is running on the grass



E.g., The clownfish fits into a tiny bbox

Prompt: A clownfish swimming in a coral ree




Limitations

TrailBlazer inherits the limitations of the underlying pre-trained model (ZeroScope). These include animals with an incorrect number of limbs and other issues common to a number of diffusion-based T2I and T2V methods.

Conclusion

Our contributions are listed below:

  • Novelty: We introduce a novel approach, TrailBlazer, employing high-level bounding boxes to guide the subject in diffusion-based video synthesis. This approach is suitable for casual users, as it avoids the need to record or draw a frame-by-frame positioning control signal. In contrast, the low-level guidance signals (detailed masks, edge maps) used by some other approaches have two disadvantages: it is difficult for non-artists to draw these shapes, and processing existing videos to obtain these signals limits the available motion to copies of existing sources.
  • Trajectory control: TrailBlazer enables users to position the subject by keyframing its bounding box. The size of the bbox can be similarly controlled, thereby producing directional motion and perspective effects. Finally, users can also keyframe the text prompt to influence the behavior of the subject in the synthesized video.
  • Simplicity: TrailBlazer operates by directly editing the spatial and temporal attention in the pre-trained denoising UNet. It requiring no training or optimization, and the core algorithm can be implemented in less than 200 lines of code.



Citation

TrailBlazer will be further enhanced to improve the quality and usability. If you find our work interesting, please cite our article.

BibTeX


      @misc{ma2023trailblazer,
            title={TrailBlazer: Trajectory Control for Diffusion-Based Video Generation},
            author={Wan-Duo Kurt Ma and J. P. Lewis and W. Bastiaan Kleijn},
            year={2023},
            eprint={2401.00896},
            archivePrefix={arXiv},
            primaryClass={cs.CV}
      }