Unreal Engine animated scene, bird

Our method maintains consistent subject identities across shots and and follows the text prompts as good as the pretrained model. VideoCrafter2 shows diverse motion but inconsistent characters, Tokenflow Encoder mainly affects coloring, and causes blurring, ConsiS Im2vid struggles with motion alignment and introduces inconsistent identities. VSTAR Struggles with with adhering to text prompts, but maintains good identity, and shows extensive non-specific motion.

Ours

baking cookies

riding a roller

playing w. trees

VideoCrafter2

Tokenflow Encoder

ConsiS Im2vid

VSTAR