Fun With Diffusion Models

CS 180 Project 5

Jiayang Wang | jiayang.wang@berkeley.edu

Introduction

This project explores the implementation and deployment of diffusion models for image generations.

Part A: The Power of Diffusion Models

Setup

For part A of the project, I used the DeepFloyd/IF-I-XL-v1.0 text-to-image diffusion model to produce images of size 64*64 using various prompts and techniques. The random seed I used is 360.
Below are the example output of the model using different inference steps marked as num_inference_steps

step10_1.png

an oil painting of a snowy mountain village
num_inference_steps = 10

step10_2.png

a man wearing a hat
num_inference_steps = 10

step10_3.png

a rocket ship
num_inference_steps = 10

step20_1.png

an oil painting of a snowy mountain village
num_inference_steps = 20

step20_2.png

a man wearing a hat
num_inference_steps = 20

step20_3.png

a rocket ship
num_inference_steps = 20

step30_1.png

an oil painting of a snowy mountain village
num_inference_steps = 30

step30_2.png

a man wearing a hat
num_inference_steps = 30

step30_3.png

a rocket ship
num_inference_steps = 30

The quality of the output images depends on both the text prompts and the value of num_inference_steps. We can see that this model generates relatively realistic images with human related prompt, but gets more abstract when generating images of objects, as seen in the "a rocket ship" images. Higher num_inference_steps adds more details to the output, increasing the overall quality.

Sampling Loops

Implementing the Forward Process

The forward process of diffusion takes a clean image and progressively adds noise to it at each timestep t, as shown in the formula.

forward.png

t = 0 corresponds to the clean image and larger t corresponds to more noise added. Following the formula, I implemented the forward process and generated the test image at t = [250, 500, 750].

0.png

Berkeley Campanile

250.png

Noisy Campanile at t = 250

500.png

Noisy Campanile at t = 500

750.png

Noisy Campanile at t = 750

Classical Denoising

A classical method of denoising is applying Gaussian blur on the noisy images and try to filter out the high frequency noise. Below are the results of applying this method on the previously noised images with kernel_size = 5 and sigma = 1.

250.png

Noisy Campanile at t = 250

500.png

Noisy Campanile at t = 500

750.png

Noisy Campanile at t = 750

denoise_250.png

Gaussian Blur Denoising at t = 250

denoise_500.png

Gaussian Blur Denoising at t = 500

denoise_750.png

Gaussian Blur Denoising at t = 750

It is clear that classical denoising does not work well on these noisy images, and we need a better approach.

One-Step Denoising

The stage 1 UNet of the DeepFloyd diffusion model is a pretrained denoiser that predicts the Gaussian noise added to the image. Thus, removing the predicted noise from the noisy image will result in an estimation that is close to the original clean image. As the UNet is conditioned on the amount of Gaussian noise by timestep t, I passed in t = [250, 500, 750] with the corresponding noisy images, and got the following results:

250.png

Noisy Campanile at t = 250

500.png

Noisy Campanile at t = 500

750.png

Noisy Campanile at t = 750

denoised_250.png

One-Step Denoised Campanile at t = 250

denoised_500.png

One-Step Denoised Campanile at t = 500

denoised_750.png

One-Step Denoised Campanile at t = 750

The result of this one-step denoising is already much better than the results of classical denoising. However, the quality dropoff happens when t is higher and the image is very noisy.

Iterative Denoising

The reason of quality dropoff is that diffusion models are designed to denoise iteratively instead of only one step.
Following the formula below, at each monotonically decreasing timestep in the list strided_timesteps, the model estimates a less noisy image by removing partial predicted noise from the noisy image, until t is close to 0 and the image is clean. In this case, I started at t = 990 with a stride of 30.

noisy_90.png

Noisy Campanile at t = 90

noisy_240.png

Noisy Campanile at t = 240

noisy_390.png

Noisy Campanile at t = 390

noisy_540.png

Noisy Campanile at t = 540

noisy_690.png

Noisy Campanile at t = 690

original.png

Original

iterative.png

Iteratively Denoised Campanile

onestep.png

One-Step Denoised Campanile

gaussian.png

Gaussian Blurred Campanile

Comparing the outputs of the three denoising methods, iterative denoising produces the clearest result that is also the most similar to the original image.

Diffusion Model Sampling

Using the same iterative denoising as above, we can also generate image from scratch. Starting from pure noise, timestep strided_timesteps[0], and the prompt "a high quality photo" the diffusion model will denoise this pure noise while following the prompt.

1.png

Sample 1

2.png

Sample 2

3.png

Sample 3

4.png

Sample 4

5.png

Sample 5

Classifier-Free Guidance (CFG)

The 5 generated images in the last part are reasonable "real photos", but look bland and have relatively low color saturation.
Classifier-Free Guidance (CFG) improves the output quality by computing both compute both a conditional and an unconditional noise estimate, denoted ε_c and ε_u. By following the formula below and setting γ = 7, CFG will push the actual noise estimate toward the direction of the conditional noise estimate, thus essentially increasing the effectiveness of prompt conditioning.

1.png

Sample 1 with CFG

2.png

Sample 2 with CFG

3.png

Sample 3 with CFG

4.png

Sample 4 with CFG

5.png

Sample 5 with CFG

While following the same steps as in the last part, the resulting image with CFG and γ = 7 has much higher quality.

Image-to-image Translation

With iterative denoising and CFG, we can use a clean test image as the base image, add various amount noise to it, and then denoise without any conditioning. The result of this SDEdit algorithm will be images similar to the test image, with higher value of i_start in strided_timesteps[i_start] producing more similar images.

1.png
2.png
3.png

Editing Hand-Drawn and Web Images

The procedure in the last part works vert well if we want to project a nonrealistic image, including painting and sketches, onto the manifold of natural images. I experimented with some of the hand-drawn images, or realistic images with some scribbles on them, and the results are quite interesting.

1.png
2.png
3.png

Inpainting

Another way to use this procedure is inpainting, where we can define a mask on a clean image, such that at each step of the diffusion denoising loop, the pixels with mask equal to 0 is replaced with the corresponding pixels in the original image. This results in only the pixels with mask value 1 to be replaced with a new generated image. The results are really interesting!

1.png
2.png
3.png

Text-Conditional Image-to-image Translation

The image-to-image translation procedure can also be conditioned by text prompts other than the baseline "a high quality photo". By adding controls using text, the output images gradually look more like the original image when i_start is higher, and look more like the text prompt when i_start is lower. When i_start is around 10, the output image is a nice hybrid between the original image and the text prompt.

1.png

"a rocket ship"

2.png

"a photo of a dog"

3.png

"a man wearing a hat"

Visual Anagrams

By manipulating the noise estimate at each timestep, diffusion models can produce many interesting results. To generate a visual anagram that looks like one thing, but when flipped upside down will reveal another thing, I combined the noise estimate at each timestep for two prompts, and combined one noise with another noise but flipped. The results are shown below.

1.png

"an oil painting of people around a campfire"

1_flip.png

"an oil painting of an old man"

2.png

"an oil painting of a snowy mountain village"

2_flip.png

"a man wearing a hat"

3.png

"a photo of the amalfi cost"

3_flip.png

"a photo of a hipster barista"

Hybrid Images

Another fun thing to do with diffusion models is generating hybrid images, which display one image when looking far away, and another image when closing up. Instead of averaging noise estimate at each timestep for two prompts, I created composite noise estimate by combining low frequencies from one noise estimate with high frequencies of the other. For both the high-pass filter and the low-pass filter, I set kernel_size = 33 and sigma = 2. The result images indeed have the hybrid image behavior!

1.png

Low-pass: "a lithograph of a skull"
High-pass: "a lithograph of waterfalls"

2.png

Low-pass: "a photo of the amalfi cost"
High-pass: "a photo of a dog"

3.png

Low-pass: "an oil painting of an old man"
High-pass: "an oil painting of people around a campfire"

Part B: Diffusion Models from Scratch

In part B, I trained diffusion models on MNIST from scratch, and added various conditioning to the UNets.

Training a Single-Step Denoising UNet

I first started with implementing a simple one-step denoiser. The denoiser is trained on optimizing the following L2 loss:

denoising_loss.png

D_θ(z) is the output of the denoiser, and x is the clean image.

Implementing the UNet

I implemented the denoiser as a UNet with the following downsampling and upsampling blocks with skip connections as shown below.

unconditional_arch.png

The operations in the diagram above are the following:

atomic_ops_new.png

Using the UNet to Train a Denoiser

In the forward noising process, I passed in training pairs (z, x), where the noisy image z is created by adding a Gaussian noise to the training image x for each training batch. σ is a hyperparameter indicating the amount of noise added.

denoising_forward.png
noise_1.png
noise_2.png
noise_3.png

For the actual training process, I used the following parameters:
sigma = 0.5
batch_size = 256
num_epochs = 5
Adam optimizer with learning_rate = 1e-4
UNet with hidden dimension num_hidden = 128

loss.png

Some sample results after the 1st and 5th epoch are displayed below:

output_1.png

Results After 1 Epoch of Training

output_5.png

Results After 5 Epochs of Training

Out-of-Distribution Testing

While this denoiser is trained on MNIST digits noised with σ = 0.5, I also test the denoiser on denoising test set digits with varying levels of noise.

distribution_1.png
distribution_2.png

Results with Varying Noise Levels

The result showed that this denoiser is not generalized enough, and performs poorly with higher σ.

Training a Time-Conditioning Diffusion Model

To train the full diffusion model similar to DeepFloyd, I changed the UNet to predict the added noise ε instead of the clean image. The new loss function is the following:

diffusion_loss.png

ε_θ(z) is the noise estimate of the diffusion UNet.

Instead of one-step denoising like the denoiser in the last part, iteratively denoise the image generally yield better results, similar to the conclusion in part A. Therefore, we need to inject the scalar t into the UNet to make it time-conditioned.

diffusion_time_loss.png

Since we only need to generate MNIST digits of size 28*28, I set T = 300 instead of 1000, and defined α and β in ddpm_schedule following the below procedure similar to part A.

ddpm_schedule.png

The new UNet architecture is shown below. I inject the time-conditioning signal with a new operator FCBlock:

conditional_arch.png
fc_long.png

Training the UNet

I trained the UNet using the following training algorithm and parameters.

algo1_t_only.png

batch_size = 128
num_epochs = 20
Adam optimizer with initiallearning_rate = 1e-3
Exponential learning rate decay scheduler scheduler = torch.optim.lr_scheduler.ExponentialLR(optimizer, 0.1 ** (1.0 / num_epochs))
UNet with hidden dimension num_hidden = 64

loss.png

Sampling from the UNet

algo2_t_only.png

Sampling using the following algorithm on epoch 5 and epoch 20, we can see the progress of training.

output_5.png

Results After 5 Epochs of Training

output_20.png

Results After 20 Epochs of Training

Notice that the results had a clear and distinguishable shape of digits, but we have no ways to control which digit the model output.

Adding Class-Conditioning to UNet

To add more control over what digit the model generates, I also tried conditioning the UNet on the class of the digits 0-9, by adding 2 more FCBlock to the UNet. The new FCBlocks take a one-hot conditioning vector on the class of the digit 0-9 as input. To retain the function of generating without the class condition, I set the dropout rate to 0.1, such that 10% of the time, the class conditioning vector is set to all zero, thus effectively removing the conditioning.

Following the below training algorithm with the conditioning vector c, I trained the new class-conditioned UNet with the same parameters as above.

algo3_c.png
loss.png

Sampling from the Class-Conditioned UNet

algo4_c.png

For the sampling process, besides using the conditioning vector c to control the digit generated, we also need CFG since part A shows that conditioning need classifier-free guidance to be effective. I used CFG with γ = 5.0

output_5.png

Results After 5 Epochs of Training

output_20.png

Results After 20 Epochs of Training

As we can see, the result this time, especially after 20 epochs of training, is much better in both quality and control.

This project is easily the most interesting one out of all the projects. The fact that while the reason that CFG greatly improves the conditioned result is still up to vigorous debate, it simply "just works" is particularly fascinating to me!