Our method maintains consistent subject identities across shots and and follows the text prompts as good as the pretrained model. VideoCrafter2 shows diverse motion but inconsistent characters, Tokenflow Encoder mainly affects coloring, and causes blurring, ConsiS Im2vid struggles with motion alignment and introduces inconsistent identities. VSTAR Struggles with with adhering to text prompts, but maintains good identity, and shows extensive non-specific motion.