Cheerful 90's anime-style girl

Our method maintains consistent subject identities across shots and follows the text prompts. VideoCrafter2 shows diverse motion but inconsistent characters, Tokenflow Encoder mainly affects coloring, and ConsiS Im2vid introduces inconsistent identities. VSTAR struggles with prompt adherence, briefly displays initial and final scenes while transitioning to the middle sequence, yet maintains good identity and shows extensive motion.

Ours

throwing frisbee

playing Wii

paddling board

VideoCrafter2

Tokenflow Encoder

ConsiS Im2vid

VSTAR