Qualitative Comparisons with Baselines

Ours (top row) demonstrates improved character consistency across shots while maintaining natural motion.

VideoCrafter2 (second row) is the vanilla model, showing diverse motion but inconsistent characters between shots.

Tokenflow-Encoder (third row) preserves original motion but struggles with character consistency and introduces coloring artifacts.

ConsiS Im2Vid (bottom row) fails to maintain consistency across shots and exhibits limited motion adherence to text prompts.

VSTAR struggles with prompt adherence. It may briefly show initial/final scenes before transitioning to middle sequences. It maintains good identity and shows extensive non-specific motion.