AniClipart: Clipart Animation with
Text-to-Video Priors


1City University of Hong Kong, 2Monash University

Abstract

Clipart, a pre-made graphic art form, offers a convenient and efficient way of illustrating visual content. Traditional workflows to convert static clipart images into motion sequences are laborious and time-consuming, involving numerous intricate steps like rigging and in-betweening. Recent advancements in text-to-video generation hold great potential in resolving this problem. Nevertheless, direct application of text-to-video generation models often struggles to retain the visual identity of clipart images or generate iconic (i.e., cartoon-style) motions, resulting in unsatisfactory animation outcomes.
In this paper, we introduce AniClipart, a system that transforms static clipart images into high-quality motion sequences guided by text-to-video priors. To generate iconic and smooth motion, we first define Bézier curves over keypoints of the clipart image as a form of motion regularization. We then align the motion trajectories of the keypoints with the provided text prompt by optimizing the Video Score Distillation Sampling (VSDS) loss, which encodes adequate knowledge of natural motion within a pretrained text-to-video diffusion model. With a differentiable As-Rigid-As-Possible shape deformation algorithm, our method can be end-to-end optimized while maintaining deformation rigidity.
Experimental results show that the proposed AniClipart consistently outperforms existing image-to-video generation models, in terms of text-video alignment, visual identity preservation, and motion consistency. Furthermore, we showcase the versatility of AniClipart by adapting it to generate a broader array of animation formats, such as layered animation, which allows topological changes.

How does it work?



We define Bézier curves as the motion trajectories of keypoints.
We select frame numbers Q, define timesteps, and sample points along the adjusted Bézier paths to determine new positions of keypoints. These new keypoints guide the ARAP algorithm to warp the triangular mesh, transforming the clipart to new poses.
Subsequently, we use a differentiable renderer to convert the updated shapes into a video V and send it to a video diffusion model to compute VSDS loss. We also incorporate a skeleton loss to maintain the original shape.

Varying the Prompts


woman_dance
We can alter the prompts to generate different movements.

Multi-Layer Animation


High-Order Bézier Trajectory


Comparisons to T2V Models & Prior Work


man_fencing
crab
We compare our method to five baselines: Four Text-to-Video (T2V) diffusion models (ModelScope, VideoCrafter, DynamiCrafter and I2VGen-XL) and LiveSketch.

Ablation Study


Eliminating key components from our system could lead to animations with restricted movements (e.g., "Linear Interpolation" and "Image SDS Loss"), shape distortions (e.g., "Linear Blend Skinning" and "w/o Skeleton Loss"), and inconsistencies across frames (e.g., "w/o Trajectory").