- Results in Table 1 on synthetic dataset (CelebA-Test) from main paper.
- Results in Table 2 on real-world datasets (Wider-Test, LFW-Test, WebPhoto-Test) from main paper.
Diffusion models have demonstrated impressive performance in face restoration. Yet, their multi-step inference process remains computationally intensive, limiting their applicability in real-world scenarios. Moreover, existing methods often struggle to generate face images that are harmonious, realistic, and consistent with the subject’s identity. In this work, we propose OSDFace, a novel one-step diffusion model for face restoration. Specifically, we propose a visual representation embedder (VRE) to better capture prior information and understand the input face. In VRE, low-quality faces are processed by a visual tokenizer and subsequently embedded with a vector-quantized dictionary to generate visual prompts. Additionally, we incorporate a facial identity loss derived from face recognition to further ensure identity consistency. We further employ a generative adversarial network (GAN) as a guidance model to encourage distribution alignment between the restored face and the ground truth. Experimental results demonstrate that OSDFace surpasses current state-of-the-art (SOTA) methods in both visual quality and quantitative metrics, generating high-fidelity, natural face images with high identity consistency.
We propose OSDFace, a novel one-step diffusion model for face restoration. First, to establish a visual representation embedder (VRE), we train the autoencoder and VQ dictionary for HQ and LQ face domains using self-reconstruction and feature association loss \(\mathcal{L}_{\text{assoc}}\). Then, we use the VRE containing LQ encoder and dictionary to embed the LQ face \(I_L\), producing the visual prompt embedding \(p_L\). Next, the LQ image \(I_L\) along with \(p_L\) are inputed into the generator \(\mathcal{G}_\theta\) to yield the predicted HQ face \(\hat{I}_H\): \(\hat{I}_H=\mathcal{G}_\theta(I_L; \operatorname{VRE}(I_L))\). The generator \(\mathcal{G}_\theta\) incorporates the pretrained VAE and UNet from Stable Diffusion, with only the UNet fine-tuned via LoRA. Additionally, a series of feature alignment losses are applied to ensure the generation of harmonious and coherent face images. The generator and discriminator are trained alternately.
@article{wang2024osdface,
title={One-Step Diffusion Model for Face Restoration},
author={Wang, Jingkai and Gong, Jue and Zhang, Lin and Chen, Zheng and Liu, Xing and Gu, Hong and Liu, Yutong and Zhang, Yulun and Yang, Xiaokang},
journal={arXiv preprint arXiv:2411.17163},
year={2024}
}