One-Step Diffusion Model for Face Restoration

1Shanghai Jiao Tong University, 2vivo Mobile Communication Co., Ltd
*Indicates Equal Contribution.
Indicates Corresponding Authors.

Face Restoration

Abstract

Diffusion models have demonstrated impressive performance in face restoration. Yet, their multi-step inference process remains computationally intensive, limiting their applicability in real-world scenarios. Moreover, existing methods often struggle to generate face images that are harmonious, realistic, and consistent with the subject’s identity. In this work, we propose OSDFace, a novel one-step diffusion model for face restoration. Specifically, we propose a visual representation embedder (VRE) to better capture prior information and understand the input face. In VRE, low-quality faces are processed by a visual tokenizer and subsequently embedded with a vector-quantized dictionary to generate visual prompts. Additionally, we incorporate a facial identity loss derived from face recognition to further ensure identity consistency. We further employ a generative adversarial network (GAN) as a guidance model to encourage distribution alignment between the restored face and the ground truth. Experimental results demonstrate that OSDFace surpasses current state-of-the-art (SOTA) methods in both visual quality and quantitative metrics, generating high-fidelity, natural face images with high identity consistency.

Method

Overview of OSDFace
Overview of OSDFace

We propose OSDFace, a novel one-step diffusion model for face restoration. First, to establish a visual representation embedder (VRE), we train the autoencoder and VQ dictionary for HQ and LQ face domains using self-reconstruction and feature association loss \(\mathcal{L}_{\text{assoc}}\). Then, we use the VRE containing LQ encoder and dictionary to embed the LQ face \(I_L\), producing the visual prompt embedding \(p_L\). Next, the LQ image \(I_L\) along with \(p_L\) are inputed into the generator \(\mathcal{G}_\theta\) to yield the predicted HQ face \(\hat{I}_H\): \(\hat{I}_H=\mathcal{G}_\theta(I_L; \operatorname{VRE}(I_L))\). The generator \(\mathcal{G}_\theta\) incorporates the pretrained VAE and UNet from Stable Diffusion, with only the UNet fine-tuned via LoRA. Additionally, a series of feature alignment losses are applied to ensure the generation of harmonious and coherent face images. The generator and discriminator are trained alternately.

Results

Quantitative Comparisons (click to expand)
  • Results in Table 1 on synthetic dataset (CelebA-Test) from main paper.

  • Results in Table 2 on real-world datasets (Wider-Test, LFW-Test, WebPhoto-Test) from main paper.

Visual Comparisons (click to expand)
  • Results in Figure 5 on synthetic dataset (CelebA-Test) from main paper.

  • Results in Figure 6 on real-world dataset (Wider-Test, LFW-Test, WebPhoto-Test) from main paper.

More Comparisons on Synthetic Dataset...

More Comparisons on Real-World Dataset...

BibTeX

@article{wang2024osdface,
    title={One-Step Diffusion Model for Face Restoration},
    author={Wang, Jingkai and Gong, Jue and Zhang, Lin and Chen, Zheng and Liu, Xing and Gu, Hong and Liu, Yutong and Zhang, Yulun and Yang, Xiaokang},
    journal={arXiv preprint arXiv:2411.17163},
    year={2024}
}