menu_book Explore the article's raw data

Memorizing Swin-Transformer Denoising Network for Diffusion Model

Abstract

Diffusion models have garnered significant attention in the field of image generation. However, existing denoising architectures, such as U-Net, face limitations in capturing the global context, while Vision Transformers (ViTs) may struggle with local receptive fields. To address these challenges, we propose a novel Swin-Transformer-based denoising network architecture that leverages the strengths of both U-Net and ViT. Moreover, our approach integrates the k-Nearest Neighbor (kNN) based memorizing attention module into the Swin-Transformer, enabling it to effectively harness crucial contextual information from feature maps and enhance its representational capacity. Finally, we introduce an innovative hierarchical time stream embedding scheme that optimizes the incorporation of temporal cues during the denoising process. This method surpasses basic approaches like simple addition or concatenation of fixed time embeddings, facilitating a more effective fusion of temporal information. Extensive experiments conducted on four benchmark datasets demonstrate the superior performance of our proposed model compared to U-Net and ViT as denoising networks. Our model outperforms baselines on the CRC-VAL-HE-7K and CelebA datasets, achieving improved FID scores of 14.39 and 4.96, respectively, and even surpassing DiT and UViT under our experiment setting. The Memorizing Swin-Transformer architecture, coupled with the hierarchical time stream embedding, sets a new state-of-the-art in denoising diffusion models for image generation.

article Article
date_range 2024
language English
link Link of the paper
format_quote
Sorry! There is no raw data available for this article.
Loading references...
Loading citations...
Featured Keywords

diffusion models
denoising network
swin-transformer
memorizing attention mechanism
Citations by Year

Share Your Research Data, Enhance Academic Impact