Frequency-Controlled Diffusion Model for Versatile Text-Guided Image-to-Image Translation

AAAI 2024


Xiang Gao,     Zhengbo Xu,     Junhan Zhao,     Jiaying Liu

Wangxuan Institute of Computer Technology, Peking University
{gaoxiang1102, icey.x, liujiaying}@pku.edu.cn

Abstract


Recently, large-scale text-to-image (T2I) diffusion models have emerged as a powerful tool for image-to-image translation (I2I), allowing open-domain image translation via user-provided text prompts. This paper proposes frequencycontrolled diffusion model (FCDiffusion), an end-to-end diffusion-based framework that contributes a novel solution to text-guided I2I from a frequency-domain perspective. At the heart of our framework is a feature-space frequencydomain filtering module based on Discrete Cosine Transform, which filters the latent features of the source image in the DCT domain, yielding filtered image features bearing different DCT spectral bands as different control signals to the pre-trained Latent Diffusion Model. We reveal that control signals of different DCT spectral bands bridge the source image and the T2I generated image in different correlations (e.g., style, structure, layout, contour, etc.), and thus enable versatile I2I applications emphasizing different I2I correlations, including style-guided content creation, image semantic manipulation, image scene translation, and image style translation. Different from related approaches, FCDiffusion establishes a unified text-guided I2I framework suitable for diverse image translation tasks simply by switching among different frequency control branches at inference time. The effectiveness and superiority of our method for text-guided I2I are demonstrated with extensive experiments both qualitatively and quantitatively. Our project is publicly available at: https://xianggao1102.github.io/FCDiffusion/.


Architecture


method architecture

Overall architecture of FCDiffusion, as well as details of important model components.


Key Ideas


(1) Instruct image-to-image translation with natural language:
Large-scale text-to-image diffusion models have revolutionized the field of image generation. We propose to harness their immense generative power and adapt them from text-to-image generation to the realm of text-guided image-to-image translation (I2I), providing intelligent tools for image manipulation tasks.

(2) Versatile image-to-image translation with a unified framework:
Observing that I2I has diverse application scenarios emphasizing different I2I correlations (e.g., style, structure, layout, contour, etc.) between the source image and the translated image, it is difficult for a single existing method to suit all scenarios well. This inspires us to design a unified framework enabling flexible control over diverse I2I correlations and thus applies to diverse I2I application scenarios.

(3) Realizing versatile I2I translation with different modes of frequency control:
We propose to realize versatile text-guided I2I translations from a novel frequency-domain perspective: model I2I correlation of different I2I tasks with the corresponding different frequency bands of diffusion features in the frequency domain. Specifically, we filter image features in the Discrete Cosine Transform (DCT) spectrum space and extract the filtered image features carrying a specific DCT frequency band as control signal to control the corresponding I2I correlation. Accordingly, we realize I2I application of style-guided content creation, image semantic manipulation, image scene translation, and image style translation under the mini-frequency control, low-frequency control, mid-frequency control, and high-frequency control, respectively.

(4) Frequency spectrum reconstruction learning:
Our FCDiffusion extracts image features carrying different DCT spectral bands as control signal to control the denoising process of the Latent Diffusion Model (LDM). Conditioned on the control signal, the model is trained to reconstruct the filtered-out frequency spectral components of image features with the textual information from the paired text prompt. At inference time, text-driven I2I translation is thus allowed by feeding in arbitrary text prompt to guide the completion of the filtered-out DCT spectral components.


Results


Below are showcased example I2I translation results, including style-guided content creation realized by mini-frequency control; image semantic manipulation realized by low-frequency control; image style translation realized by high-frequency control; and image scene translation realized by mid-frequency control.

results for style-guided content creation





results for image semantic manipulation





results for image style translation





results for image scene translation