Setup

For part A of the project, I used the DeepFloyd/IF-I-XL-v1.0 text-to-image diffusion model to produce images of size 64*64 using various prompts and techniques. The random seed I used is 360.
Below are the example output of the model using different inference steps marked as num_inference_steps

an oil painting of a snowy mountain village
num_inference_steps = 10

a man wearing a hat
num_inference_steps = 10

a rocket ship
num_inference_steps = 10

an oil painting of a snowy mountain village
num_inference_steps = 20

a man wearing a hat
num_inference_steps = 20

a rocket ship
num_inference_steps = 20

an oil painting of a snowy mountain village
num_inference_steps = 30

a man wearing a hat
num_inference_steps = 30

a rocket ship
num_inference_steps = 30

The quality of the output images depends on both the text prompts and the value of num_inference_steps. We can see that this model generates relatively realistic images with human related prompt, but gets more abstract when generating images of objects, as seen in the "a rocket ship" images. Higher num_inference_steps adds more details to the output, increasing the overall quality.

Sampling Loops

Implementing the Forward Process

The forward process of diffusion takes a clean image and progressively adds noise to it at each timestep t, as shown in the formula.

t = 0 corresponds to the clean image and larger t corresponds to more noise added. Following the formula, I implemented the forward process and generated the test image at t = [250, 500, 750].

Berkeley Campanile

Noisy Campanile at t = 250

Noisy Campanile at t = 500

Noisy Campanile at t = 750

Classical Denoising

A classical method of denoising is applying Gaussian blur on the noisy images and try to filter out the high frequency noise. Below are the results of applying this method on the previously noised images with kernel_size = 5 and sigma = 1.

Noisy Campanile at t = 250

Noisy Campanile at t = 500

Noisy Campanile at t = 750

Gaussian Blur Denoising at t = 250

Gaussian Blur Denoising at t = 500

Gaussian Blur Denoising at t = 750

It is clear that classical denoising does not work well on these noisy images, and we need a better approach.

One-Step Denoising

The stage 1 UNet of the DeepFloyd diffusion model is a pretrained denoiser that predicts the Gaussian noise added to the image. Thus, removing the predicted noise from the noisy image will result in an estimation that is close to the original clean image. As the UNet is conditioned on the amount of Gaussian noise by timestep t, I passed in t = [250, 500, 750] with the corresponding noisy images, and got the following results:

Noisy Campanile at t = 250

Noisy Campanile at t = 500

Noisy Campanile at t = 750

One-Step Denoised Campanile at t = 250

One-Step Denoised Campanile at t = 500

One-Step Denoised Campanile at t = 750

The result of this one-step denoising is already much better than the results of classical denoising. However, the quality dropoff happens when t is higher and the image is very noisy.

Iterative Denoising

The reason of quality dropoff is that diffusion models are designed to denoise iteratively instead of only one step.
Following the formula below, at each monotonically decreasing timestep in the list strided_timesteps, the model estimates a less noisy image by removing partial predicted noise from the noisy image, until t is close to 0 and the image is clean. In this case, I started at t = 990 with a stride of 30.

Noisy Campanile at t = 90

Noisy Campanile at t = 240

Noisy Campanile at t = 390

Noisy Campanile at t = 540

Noisy Campanile at t = 690

Original

Iteratively Denoised Campanile

One-Step Denoised Campanile

Gaussian Blurred Campanile

Comparing the outputs of the three denoising methods, iterative denoising produces the clearest result that is also the most similar to the original image.

Diffusion Model Sampling

Using the same iterative denoising as above, we can also generate image from scratch. Starting from pure noise, timestep strided_timesteps[0], and the prompt "a high quality photo" the diffusion model will denoise this pure noise while following the prompt.

Sample 1

Sample 2

Sample 3

Sample 4

Sample 5

Classifier-Free Guidance (CFG)

The 5 generated images in the last part are reasonable "real photos", but look bland and have relatively low color saturation.
Classifier-Free Guidance (CFG) improves the output quality by computing both compute both a conditional and an unconditional noise estimate, denoted ε_c and ε_u. By following the formula below and setting γ = 7, CFG will push the actual noise estimate toward the direction of the conditional noise estimate, thus essentially increasing the effectiveness of prompt conditioning.

Sample 1 with CFG

Sample 2 with CFG

Sample 3 with CFG

Sample 4 with CFG

Sample 5 with CFG

While following the same steps as in the last part, the resulting image with CFG and γ = 7 has much higher quality.

Image-to-image Translation

With iterative denoising and CFG, we can use a clean test image as the base image, add various amount noise to it, and then denoise without any conditioning. The result of this SDEdit algorithm will be images similar to the test image, with higher value of i_start in strided_timesteps[i_start] producing more similar images.

Editing Hand-Drawn and Web Images

The procedure in the last part works vert well if we want to project a nonrealistic image, including painting and sketches, onto the manifold of natural images. I experimented with some of the hand-drawn images, or realistic images with some scribbles on them, and the results are quite interesting.

Inpainting

Another way to use this procedure is inpainting, where we can define a mask on a clean image, such that at each step of the diffusion denoising loop, the pixels with mask equal to 0 is replaced with the corresponding pixels in the original image. This results in only the pixels with mask value 1 to be replaced with a new generated image. The results are really interesting!

Text-Conditional Image-to-image Translation

The image-to-image translation procedure can also be conditioned by text prompts other than the baseline "a high quality photo". By adding controls using text, the output images gradually look more like the original image when i_start is higher, and look more like the text prompt when i_start is lower. When i_start is around 10, the output image is a nice hybrid between the original image and the text prompt.

"a rocket ship"

"a photo of a dog"

"a man wearing a hat"

Visual Anagrams

By manipulating the noise estimate at each timestep, diffusion models can produce many interesting results. To generate a visual anagram that looks like one thing, but when flipped upside down will reveal another thing, I combined the noise estimate at each timestep for two prompts, and combined one noise with another noise but flipped. The results are shown below.

"an oil painting of people around a campfire"

"an oil painting of an old man"

"an oil painting of a snowy mountain village"

"a man wearing a hat"

"a photo of the amalfi cost"

"a photo of a hipster barista"

Hybrid Images

Another fun thing to do with diffusion models is generating hybrid images, which display one image when looking far away, and another image when closing up. Instead of averaging noise estimate at each timestep for two prompts, I created composite noise estimate by combining low frequencies from one noise estimate with high frequencies of the other. For both the high-pass filter and the low-pass filter, I set kernel_size = 33 and sigma = 2. The result images indeed have the hybrid image behavior!

Low-pass: "a lithograph of a skull"
High-pass: "a lithograph of waterfalls"

Low-pass: "a photo of the amalfi cost"
High-pass: "a photo of a dog"

Low-pass: "an oil painting of an old man"
High-pass: "an oil painting of people around a campfire"

Training a Single-Step Denoising UNet

I first started with implementing a simple one-step denoiser. The denoiser is trained on optimizing the following L2 loss:

D_θ(z) is the output of the denoiser, and x is the clean image.

Implementing the UNet

I implemented the denoiser as a UNet with the following downsampling and upsampling blocks with skip connections as shown below.

The operations in the diagram above are the following:

Using the UNet to Train a Denoiser

In the forward noising process, I passed in training pairs (z, x), where the noisy image z is created by adding a Gaussian noise to the training image x for each training batch. σ is a hyperparameter indicating the amount of noise added.

For the actual training process, I used the following parameters:
sigma = 0.5
batch_size = 256
num_epochs = 5
Adam optimizer with learning_rate = 1e-4
UNet with hidden dimension num_hidden = 128

Some sample results after the 1st and 5th epoch are displayed below:

Results After 1 Epoch of Training

Results After 5 Epochs of Training

Out-of-Distribution Testing

While this denoiser is trained on MNIST digits noised with σ = 0.5, I also test the denoiser on denoising test set digits with varying levels of noise.

Results with Varying Noise Levels

The result showed that this denoiser is not generalized enough, and performs poorly with higher σ.

Training a Time-Conditioning Diffusion Model

To train the full diffusion model similar to DeepFloyd, I changed the UNet to predict the added noise ε instead of the clean image. The new loss function is the following:

ε_θ(z) is the noise estimate of the diffusion UNet.

Instead of one-step denoising like the denoiser in the last part, iteratively denoise the image generally yield better results, similar to the conclusion in part A. Therefore, we need to inject the scalar t into the UNet to make it time-conditioned.

Since we only need to generate MNIST digits of size 28*28, I set T = 300 instead of 1000, and defined α and β in ddpm_schedule following the below procedure similar to part A.

The new UNet architecture is shown below. I inject the time-conditioning signal with a new operator FCBlock:

Training the UNet

I trained the UNet using the following training algorithm and parameters.

batch_size = 128
num_epochs = 20
Adam optimizer with initiallearning_rate = 1e-3
Exponential learning rate decay scheduler scheduler = torch.optim.lr_scheduler.ExponentialLR(optimizer, 0.1 ** (1.0 / num_epochs))
UNet with hidden dimension num_hidden = 64

Sampling from the UNet

Sampling using the following algorithm on epoch 5 and epoch 20, we can see the progress of training.

Results After 5 Epochs of Training

Results After 20 Epochs of Training

Notice that the results had a clear and distinguishable shape of digits, but we have no ways to control which digit the model output.

Adding Class-Conditioning to UNet

To add more control over what digit the model generates, I also tried conditioning the UNet on the class of the digits 0-9, by adding 2 more FCBlock to the UNet. The new FCBlocks take a one-hot conditioning vector on the class of the digit 0-9 as input. To retain the function of generating without the class condition, I set the dropout rate to 0.1, such that 10% of the time, the class conditioning vector is set to all zero, thus effectively removing the conditioning.

Following the below training algorithm with the conditioning vector c, I trained the new class-conditioned UNet with the same parameters as above.

Sampling from the Class-Conditioned UNet

For the sampling process, besides using the conditioning vector c to control the digit generated, we also need CFG since part A shows that conditioning need classifier-free guidance to be effective. I used CFG with γ = 5.0

Results After 5 Epochs of Training

Results After 20 Epochs of Training

As we can see, the result this time, especially after 20 epochs of training, is much better in both quality and control.

This project is easily the most interesting one out of all the projects. The fact that while the reason that CFG greatly improves the conditioned result is still up to vigorous debate, it simply "just works" is particularly fascinating to me!

Fun With Diffusion Models

CS 180 Project 5

Introduction

Part A: The Power of Diffusion Models

Setup

Sampling Loops

Implementing the Forward Process

Classical Denoising

One-Step Denoising

Iterative Denoising

Diffusion Model Sampling

Classifier-Free Guidance (CFG)

Image-to-image Translation

Editing Hand-Drawn and Web Images

Inpainting

Text-Conditional Image-to-image Translation

Visual Anagrams

Hybrid Images

Part B: Diffusion Models from Scratch

Training a Single-Step Denoising UNet

Implementing the UNet

Using the UNet to Train a Denoiser

Out-of-Distribution Testing

Training a Time-Conditioning Diffusion Model

Training the UNet

Sampling from the UNet

Adding Class-Conditioning to UNet

Sampling from the Class-Conditioned UNet