DiffusionAct: Controllable Diffusion Autoencoder for One-shot Face Reenactment

1Kingston Univeristy, London, UK,
2City University of London, UK
3Queen Mary Universtity of London, UK

DiffusionAct is a diffusion-based method, that performs one-shot self and cross-subject neural face reenactment without any subject-specific fine-tuning. We demonstrate that, compared to current state-of-the-art methods, our approach produces realistic, artifact-free images, accurately transfers the target head pose and expression, and faithfully reconstructs the source identity and appearance across challenging conditions, e.g., large head pose movements.

Abstract

Video-driven neural face reenactment aims to synthesize realistic facial images that successfully preserve the identity and appearance of a source face, while transferring the target head pose and facial expressions. Existing GAN-based methods suffer from either distortions and visual artifacts or poor reconstruction quality, i.e., the background and several important appearance details, such as hair style/color, glasses and accessories, are not faithfully reconstructed. Recent advances in Diffusion Probabilistic Models (DPMs) enable the generation of high-quality realistic images. To this end, in this paper we present DiffusionAct, a novel method that leverages the photo-realistic image generation of diffusion models to perform neural face reenactment. Specifically, we propose to control the semantic space of a Diffusion Autoencoder (DiffAE), in order to edit the facial pose of the input images, defined as the head pose orientation and the facial expressions. Our method allows one-shot, self, and cross-subject reenactment, without requiring subject-specific fine-tuning. We compare against state-of-the-art GAN-, StyleGAN2-, and diffusion-based methods, showing better or on-par reenactment performance.

Method


We present a method for neural face reenactment based on a pre-trained Diffusion Probabilistic Model (DPM). Specifically, given a pair of a source ($\mathbf{x}_0^s$) and a target ($\mathbf{x}_0^t$) images, we propose to condition the pre-trained semantic encoder of a Diffusion Autoencoder (DiffAE) model on the target facial landmarks $\mathbf{y}_t$. Our reenactment encoder $\mathcal{E}_r$ learns to to predict the semantic code $\mathbf{z}_r$ that, when decoded by the pre-trained DDIM sampler, generates the reenacted image $\mathbf{x}_0^r$ that captures the source identity/appearance and the target head pose and facial expressions.

place gif

Generated Videos





Comparisons with face reenactment methods



Self Reenactment

Cross-subject Reenactment

BibTeX

 @InProceedings{bounareli2024diffusionact,
        author    = {Bounareli, Stella and Tzelepis, Christos and Argyriou, Vasileios and Patras, Ioannis and Tzimiropoulos, Georgios},
        title     = {DiffusionAct: Controllable Diffusion Autoencoder for One-shot Face Reenactment},
        journal   = {arXiv},
        year      = {2024},
    }