Cinematic, middle-aged female athlete

Our method maintains consistent subject identities across shots and and follows the text prompts as good as the pretrained model. VideoCrafter2 shows diverse motion but inconsistent characters, Tokenflow Encoder mainly affects coloring, and causes blurring, ConsiS Im2vid shows degraded motion. VSTAR fails to render the first scene, briefly flashes the last scene, and mostly transitions to and stays on the middle biking scene, with changing identities and extensive motion.

Ours

podium, tears

mountain biking

serve, tennis

VideoCrafter2

Tokenflow Encoder

ConsiS Im2vid

VSTAR