Recent video editing advancements rely on accurate pose sequences to animate human actors. However, these efforts are not suitable for cross-species animation due to pose misalignment between species (for example, the poses of a cat differs greatly from that of a pig due to their distinct body structures). In this paper, we present AnimateZoo, a zero-shot diffusion-based video generator to address this issue, aiming to accurately animate various animals while preserving the background. The key technique involves two-fold subject alignment. First, we improve appearance feature extraction by integrating a Laplacian detail booster and a prompt-tuning identity extractor. They capture essential appearance information, including identity and fine details. Second, we align shape features and address conflicts from differing animals by introducing a scale-information remover and an adaptive rescaling module. They both enhance subject alignment for accurate cross-species animation. Additionally, we introduce two high-quality animal video datasets with diverse species to benchmark cross-species animation. Trained on these extensive datasets, our model directly generates videos with accurate movements, consistent appearances, and high-fidelity frames, eliminating the need for test-time training. Extensive experiments demonstrate our method's superiority in cross-species animation, showcasing robust adaptability and generality.
@article{xu2024animatezoo,
title={AnimateZoo: Zero-shot Video Generation of Cross-Species Animation via Subject Alignment},
author={Xu, Yuanfeng and Chen, Yuhao and Huang, Zhongzhan and He, Zijian and Wang, Guangrun and Lin, Liang},
journal={arXiv preprint arXiv:2404.04946},
year={2024}
}