Large text-to-video (T2V) models such as Sora have the potential to revolutionize visual effects and the creation of some types of movies. Current T2V models require tedious trial-and-error experimentation to achieve desired results, however. This motivates the search for methods to directly control desired attributes. In this work, we take a step toward this goal, introducing a method for high-level, temporally-coherent control over the basic trajectories and appearance of objects. Our algorithm, TrailBlazer, allows the general positions and (optionally) appearance of objects to controlled simply by keyframing approximate bounding boxes and (optionally) their corresponding prompts.
Importantly, our method does not require a pre-existing control video signal that already contains an accurate outline of the desired motion, yet the synthesized motion is surprisingly natural with emergent effects including perspective and movement toward the virtual camera as the box size increases. The method is efficient, making use of a pre-trained T2V model and requiring no training or fine-tuning, with negligible additional computation. Specifically, the bounding box controls are used as soft masks to guide manipulation of the self-attention and cross-attention modules in the video model. While our visual results are limited by those of the underlying model, the algorithm may generalize to future models that use standard self- and cross-attention components.
TrailBlazer highlights the central components of spatial cross-attention editing (left, in the almond-colored section) and temporal cross-frame attention editing (right, in the blue section). This operation is exclusively applied during the denoising process in the early stage. The objective is to alter the attention map within a user-specified bounding box (bbox). For more in-depth information, please consult our main text.
Scene compositing allows the motion of several subjects to be simultaneously controlled.This algorithm first computes the initial denoising steps of each subject individually. The first figure below shows the synthesis of “a white cat” and “a yellow dog” individually, serving as a sanity check for the quality of the subjects.
Then, these per-subject intermediate results are composited and processed by a global denoising under the control of a complete prompt (“a white cat and a yellow dog…”) that includes a description of the environment (e.g., “...on the moon”). Note that interactions between the background and subjects appear plausible, as seen in the consistent shadows across all samples.
The bounding boxes and prompts can be animated via keyframes, enabling users to alter the trajectory and coarse behavior of the subject along the timeline. The resulting subject(s) fit seamlessly in the specified environment, providing a viable pipeline for video storytelling by casual users.
Please be aware that the annotated bounding boxes in all experiments below have been manually animated to enhance the viewing experience.
TrailBlazer features a novel way to guide the synthesized subject through bbox keyframing. For instance, the user can animate the fish swimming towards the camera and then goes away. Or, the user can control the cat running speed through keyframing.
In addition, TrailBlazer demonstrates subject morphing via prompt keyframing. Examples include transformations from a cat to a dog, cat to fish, parrot to penguin, and tiger to elephant, as depicted below.
Here, we present a comparative analysis between TrailBlazer and the previous approach, Peekaboo, under controlled conditions. In particular, we examine the manipulation of bounding box (bbox) keyframing in extreme scenarios, including rapid changes in bbox size, irregular bbox trajectories, swift motion determined by the middle keyframe, the number of keyframes required to move the subject to its opposite side, and static small bbox.
Please see our paper for more detail and the full metric (e.g., FID, FVD, mIoU,..) comparisons and reasonings.
E.g., Our representation of an elephant maintains a stationary position for initial 75% of video before initiating movement.
Prompt: An elephant walking on the moon
E.g., The whale gracefully descends into the ocean during the latter part of its jumping motion.
Prompt: a photorealistic whale jumping out of water while smoking a cigar
E.g., The horse accurately follows a zigzag path, simulating a galloping motion.
Prompt: A horse galloping fast on a street
E.g., Remarkably, the dog seamlessly follows a large number of keyframes (8 keyframes) within a 24-frame video clip, covering the distance from one boundary to the opposite in approximately 2 time frames.
Prompt: A dog is running on the grass
E.g., The clownfish fits into a tiny bbox
Prompt: A clownfish swimming in a coral ree
TrailBlazer inherits the limitations of the underlying pre-trained model (ZeroScope). These include animals with an incorrect number of limbs and other issues common to a number of diffusion-based T2I and T2V methods.
Our contributions are listed below:
TrailBlazer will be further enhanced to improve the quality and usability. If you find our work interesting, please cite our article.
@misc{ma2023trailblazer,
title={TrailBlazer: Trajectory Control for Diffusion-Based Video Generation},
author={Wan-Duo Kurt Ma and J. P. Lewis and W. Bastiaan Kleijn},
year={2023},
eprint={2401.00896},
archivePrefix={arXiv},
primaryClass={cs.CV}
}