Diffusion-driven GAN Inversion for Multi-Modal Face Image Generation

Jihyun Kim^1,2

Changjae Oh³

Hoseok Do¹

Soohyun Kim¹

Kwanghoon Sohn^*,2

AI Lab, CTO Division, LG Electronics¹	Yonsei University²
Queen Mary University London³	Corresponding Author^*

[Paper]

CVPR 2024 Video

Abstract

We present a new multi-modal face image generation method that converts a text prompt and a visual input, such as a semantic mask or scribble map, into a photo-realistic face image. To do this, we combine the strengths of GAN and diffusion models by employing the multi-modal features in the diffusion model into the latent space of the pre-trained GANs. We present a simple mapping network and a style modulation network to link two models and convert meaningful representations in feature maps and attention maps into latent codes. With GAN inversion, the estimated latent codes can be used to generate 2D or 3D-aware facial images. We further present a multi-step training strategy that reflects textual and structural representations into the generated image. By leveraging our proposed networks, realistic 2D, 3D, and stylized face images are produced, which align well with inputs. We validate our method by using pre-trained 2D and 3D GANs, and our results outperform existing methods.

Overview of our method

We use the mapping network, a diffusion-based encoder, the middle and decoder blocks of a denoising U-Net, that extracts the semantic features, intermediate features, and cross-attention maps.

Quantitative results

Quantitative results of multi-modal face image generation on CelebAMask-HQ with annotated text prompts.

Visual examples

Visual examples of the 2D face image generation using a text prompt and a semantic mask.

Visual examples of the 3D-aware face image generation using a text and a semantic mask. We show the images generated with inputs and arbitrary viewpoints.

Visual examples of multi-view face image generation using text prompts and scribble maps. Using (1-4) the text prompts and their corresponding (a) scribble maps, we compare the results of (b) ControlNet with (c) multi-view images generated by ours.

The results of 3D face style transfer using semantic masks and style text prompts.

Results for verifying the semantic consistency. We keep the text prompts but change the components of visual inputs in Figure 5 of the main paper, such as hair, glasses, and eyes.

CVPR 2024 Poster

Acknowledgements

This template was originally made by Phillip Isola and Richard Zhang for a colorful ECCV project; the code can be found here.