One-Step Diffusion Model for Face Restoration

1Shanghai Jiao Tong University, 2vivo Mobile Communication Co., Ltd
*Indicates Equal Contribution.
Indicates Corresponding Authors.

Face Restoration

Abstract

Diffusion models have demonstrated impressive performance in face restoration. Yet, their multi-step inference process remains computationally intensive, limiting their applicability in real-world scenarios. Moreover, existing methods often struggle to generate face images that are harmonious, realistic, and consistent with the subject’s identity. In this work, we propose OSDFace, a novel one-step diffusion model for face restoration. Specifically, we propose a visual representation embedder (VRE) to better capture prior information and understand the input face. In VRE, low-quality faces are processed by a visual tokenizer and subsequently embedded with a vector-quantized dictionary to generate visual prompts. Additionally, we incorporate a facial identity loss derived from face recognition to further ensure identity consistency. We further employ a generative adversarial network (GAN) as a guidance model to encourage distribution alignment between the restored face and the ground truth. Experimental results demonstrate that OSDFace surpasses current state-of-the-art (SOTA) methods in both visual quality and quantitative metrics, generating high-fidelity, natural face images with high identity consistency.

Method

Overview of OSDFace
Overview of OSDFace

We propose OSDFace, a novel one-step diffusion model for face restoration. First, to establish a visual representation embedder (VRE), we train the autoencoder and VQ dictionary for HQ and LQ face domains using self-reconstruction and feature association loss Lassoc. Then, we use the VRE containing LQ encoder and dictionary to embed the LQ face IL, producing the visual prompt embedding pL. Next, the LQ image IL along with pL are inputed into the generator Gθ to yield the predicted HQ face I^H: I^H=Gθ(IL;VRE(IL)). The generator Gθ incorporates the pretrained VAE and UNet from Stable Diffusion, with only the UNet fine-tuned via LoRA. Additionally, a series of feature alignment losses are applied to ensure the generation of harmonious and coherent face images. The generator and discriminator are trained alternately.

Results

Quantitative Comparisons (click to expand)
  • Results in Table 1 on synthetic dataset (CelebA-Test) from main paper.

  • Results in Table 2 on real-world datasets (Wider-Test, LFW-Test, WebPhoto-Test) from main paper.

Visual Comparisons (click to expand)
  • Results in Figure 5 on synthetic dataset (CelebA-Test) from main paper.

  • Results in Figure 6 on real-world dataset (Wider-Test, LFW-Test, WebPhoto-Test) from main paper.

More Comparisons on Synthetic Dataset...

More Comparisons on Real-World Dataset...

BibTeX

@inproceedings{wang2025osdface,
    title={{OSDFace}: One-Step Diffusion Model for Face Restoration},
    author={Wang, Jingkai and Gong, Jue and Zhang, Lin and Chen, Zheng and Liu, Xing and Gu, Hong and Liu, Yutong and Zhang, Yulun and Yang, Xiaokang},
    booktitle={CVPR},
    year={2025}
}