Stop-motion, anthropomorphic Lego sloth

Our method maintains consistent subject identities across shots. VideoCrafter2 shows diverse motion but inconsistent characters, Tokenflow Encoder causes blurring, ConsiS Im2vid shows degraded motion, and inconsistent identities (see the different facial features). VSTAR Struggles with adhering to text prompts, but maintains good identity, and shows extensive non-specific motion.

Ours

car race

stacking a tower

blow out candles

VideoCrafter2

Tokenflow Encoder

ConsiS Im2vid

VSTAR