Within recent approaches to text-to-video (T2V) generation, achieving controllability in the synthesized video is often a challenge. Typically, this issue is addressed by providing low-level per-frame guidance in the form of edge maps, depth maps, or an existing video to be altered. However, the process of obtaining such guidance can be labor-intensive. This paper focuses on enhancing controllability in video synthesis by employing straightforward bounding boxes to guide the subject in various ways, all without the need for neural network training, finetuning, optimization at inference time, or the use of pre-existing videos.
Our algorithm, TrailBlazer, is constructed upon a pre-trained (T2V) model, and easy to implement. The subject is directed by a bounding box through the proposed spatial and temporal attention map editing. Moreover, we introduce the concept of keyframing, allowing the subject trajectory, morphing, and overall appearance to be guided by both a moving bounding box and corresponding prompts, without the need to provide a detailed mask. The method is efficient, with negligible additional computation relative to the underlying pre-trained model. Despite the simplicity of the bounding box guidance, the resulting motion is surprisingly natural, with emergent effects including perspective and movement toward the virtual camera as the box size increases.
TrailBlazer highlights the central components of spatial cross-attention editing (left, in the almond-colored section) and temporal cross-frame attention editing (right, in the blue section). This operation is exclusively applied during the denoising process in the early stage. The objective is to alter the attention map within a user-specified bounding box (bbox). For more in-depth information, please consult our main text.
Scene compositing allows the motion of several subjects to be simultaneously controlled.This algorithm first computes the initial denoising steps of each subject individually. The first figure below shows the synthesis of “a white cat” and “a yellow dog” individually, serving as a sanity check for the quality of the subjects.
Then, these per-subject intermediate results are composited and processed by a global denoising under the control of a complete prompt (“a white cat and a yellow dog…”) that includes a description of the environment (e.g., “...on the moon”). Note that interactions between the background and subjects appear plausible, as seen in the consistent shadows across all samples.
The bounding boxes and prompts can be animated via keyframes, enabling users to alter the trajectory and coarse behavior of the subject along the timeline. The resulting subject(s) fit seamlessly in the specified environment, providing a viable pipeline for video storytelling by casual users.
Please be aware that the annotated bounding boxes in all experiments below have been manually animated to enhance the viewing experience.
TrailBlazer features a novel way to guide the synthesized subject through bbox keyframing. For instance, the user can animate the fish swimming towards the camera and then goes away. Or, the user can control the cat running speed through keyframing.
In addition, TrailBlazer demonstrates subject morphing via prompt keyframing. Examples include transformations from a cat to a dog, cat to fish, parrot to penguin, and tiger to elephant, as depicted below.
Here, we present a comparative analysis between TrailBlazer and the previous approach, Peekaboo, under controlled conditions. In particular, we examine the manipulation of bounding box (bbox) keyframing in extreme scenarios, including rapid changes in bbox size, irregular bbox trajectories, swift motion determined by the middle keyframe, the number of keyframes required to move the subject to its opposite side, and static small bbox.
Please see our paper for more detail and the full metric (e.g., FID, FVD, mIoU,..) comparisons and reasonings.
E.g., Our representation of an elephant maintains a stationary position for initial 75% of video before initiating movement.
Prompt: An elephant walking on the moon
E.g., The whale gracefully descends into the ocean during the latter part of its jumping motion.
Prompt: a photorealistic whale jumping out of water while smoking a cigar
E.g., The horse accurately follows a zigzag path, simulating a galloping motion.
Prompt: A horse galloping fast on a street
E.g., Remarkably, the dog seamlessly follows a large number of keyframes (8 keyframes) within a 24-frame video clip, covering the distance from one boundary to the opposite in approximately 2 time frames.
Prompt: A dog is running on the grass
E.g., The clownfish fits into a tiny bbox
Prompt: A clownfish swimming in a coral ree
TrailBlazer inherits the limitations of the underlying pre-trained model (ZeroScope). These include animals with an incorrect number of limbs and other issues common to a number of diffusion-based T2I and T2V methods.
Our contributions are listed below:
TrailBlazer will be further enhanced to improve the quality and usability. If you find our work interesting, please cite our article.
@misc{ma2023trailblazer,
title={TrailBlazer: Trajectory Control for Diffusion-Based Video Generation},
author={Wan-Duo Kurt Ma and J. P. Lewis and W. Bastiaan Kleijn},
year={2023},
eprint={2401.00896},
archivePrefix={arXiv},
primaryClass={cs.CV}
}