TrailBlazer: Trajectory Control for Diffusion-Based Video Generation

1Victoria University of Wellington,   2NVIDIA Research

TrailBlazer features text-to-video diffusion based video editing using a pre-trained model without further model training, finetuning, or online optimization. It supports various user experiences as depicted.

Abstract

WIth recent approaches to text-to-video (T2V) generation, achieving controllability in the synthesized video is often a challenge. Typically, this issue is addressed by providing low-level per-frame guidance in the form of edge maps, depth maps, or an existing video to be altered. However, the process of obtaining such guidance can be labor-intensive and the restriction to existing videos limits creativity. This paper focuses on providing controllability in video synthesis by employing straightforward bounding boxes to guide the subject in various ways, all without the use of pre-existing videos or the need for neural network training, finetuning, optimization at inference time.

Our algorithm, TrailBlazer, is constructed upon a pre-trained (T2V) model, and easy to implement. The subject is directed by a bounding box through the proposed spatial and temporal attention map editing. Moreover, we support guiding the object trajectory and its appearance by keyframing a moving bounding box and corresponding prompts, without the need to provide a detailed mask. The method is efficient, with negligible additional computation relative to the underlying pre-trained model. Despite the simplicity of the bounding box guidance, the resulting motion is surprisingly natural, with emergent effects including perspective and movement toward the virtual camera as the box size increases.

Core Method

TrailBlazer highlights the central components of spatial cross-attention editing (left, in the golden section) and temporal cross-frame attention editing (right, in the blue section). This operation is exclusively applied during the denoising process in the early stage. The objective is to alter the attention map within a user-specified bounding box (bbox). For more in-depth information, please consult our main text.


Scene compositing

Scene compositing allows the motion of several subjects to be simultaneously controlled.This algorithm first computes the initial denoising steps of each subject individually. The first figure below shows the synthesis of “a white cat” and “a yellow dog” individually, serving as a sanity check for the quality of the subjects.

Then, these per-subject intermediate results are composited and processed by a global denoising under the control of a complete prompt (“a white cat and a yellow dog…”) that includes a description of the environment (e.g., “...on the moon”). Note that interactions between the background and subjects appear plausible, as seen in the consistent shadows across all samples.


Keyframing

The bounding boxes and prompts can be animated via keyframes, enabling users to alter the trajectory and coarse behavior of the subject along the timeline. The resulting subject(s) fit seamlessly in the specified environment, providing a viable pipeline for video storytelling by casual users.

TrailBlazer features a novel way to guide the synthesized subject through bbox keyframing. For instance, the user can animate the fish swimming towards the camera and then goes away. Or, the user can control the cat running speed through keyframing. TrailBlazer’s keyframing is surprisingly powerful. For example, a fish swimming toward the camera and then away can be animated simply by increasing and then reducing the bbox size.


Limitations

TrailBlazer inherits the limitations of the underlying pre-trained model (ZeroScope). These include animals with an incorrect number of limbs and other issues common to a number of diffusion-based T2I and T2V methods.

Conclusion

Our contributions are listed below:

  1. Novelty: We introduce a novel approach, TrailBlazer, employing high-level bounding boxes to guide the subject in diffusion-based video synthesis. This approach is suitable for casual users, as it avoids the need to record or draw a frame-by-frame positioning control signal. In contrast, the low-level guidance signals (detailed masks, edge maps) used by some other approaches have two disadvantages: it is difficult for non-artists to draw these shapes, and processing existing videos to obtain these signals limits the available motion to copies of existing sources.
  2. Trajectory control: TrailBlazer enables users to position the subject by keyframing its bounding box. The size of the bbox can be similarly controlled, thereby producing directional motion and perspective effects. Finally, users can also keyframe the text prompt to influence the behavior of the subject in the synthesized video.
  3. Simplicity: TrailBlazer operates by directly editing the spatial and temporal attention in the pre-trained denoising UNet. It requiring no training or optimization, and the core algorithm can be implemented in less than 200 lines of code.

Citation

TrailBlazer will be further enhanced to improve the quality and usability. If you find our work interesting, please cite our article.

BibTeX


      @misc{ma2023trailblazer,
            title={TrailBlazer: Trajectory Control for Diffusion-Based Video Generation}, 
            author={Wan-Duo Kurt Ma and J. P. Lewis and W. Bastiaan Kleijn},
            year={2023},
            eprint={2401.00896},
            archivePrefix={arXiv},
            primaryClass={cs.CV}
      }