Introduction
This project explores the implementation and deployment of diffusion models for image generations.
Part A: The Power of Diffusion Models
Setup
For part A of the project, I used the DeepFloyd/IF-I-XL-v1.0
text-to-image diffusion model to produce images of size 64*64 using various prompts and techniques. The random seed I used is 360.
Below are the example output of the model using different inference steps marked as num_inference_steps
an oil painting of a snowy mountain villagenum_inference_steps = 10
a man wearing a hatnum_inference_steps = 10
a rocket shipnum_inference_steps = 10
an oil painting of a snowy mountain villagenum_inference_steps = 20
a man wearing a hatnum_inference_steps = 20
a rocket shipnum_inference_steps = 20
an oil painting of a snowy mountain villagenum_inference_steps = 30
a man wearing a hatnum_inference_steps = 30
a rocket shipnum_inference_steps = 30
The quality of the output images depends on both the text prompts and the value of num_inference_steps
. We can see that
this model generates relatively realistic images with human related prompt, but gets more abstract when generating images of objects,
as seen in the "a rocket ship" images. Higher num_inference_steps
adds more details to the output, increasing the overall quality.
Sampling Loops
Implementing the Forward Process
The forward process of diffusion takes a clean image and progressively adds noise to it at each timestep t
,
as shown in the formula.
t = 0
corresponds to the clean image and larger t
corresponds to more noise added. Following the formula,
I implemented the forward process and generated the test image at t = [250, 500, 750]
.
Berkeley Campanile
Noisy Campanile at t = 250
Noisy Campanile at t = 500
Noisy Campanile at t = 750
Classical Denoising
A classical method of denoising is applying Gaussian blur on the noisy images and try to filter out the high frequency noise.
Below are the results of applying this method on the previously noised images with kernel_size = 5
and sigma = 1
.
Noisy Campanile at t = 250
Noisy Campanile at t = 500
Noisy Campanile at t = 750
Gaussian Blur Denoising at t = 250
Gaussian Blur Denoising at t = 500
Gaussian Blur Denoising at t = 750
It is clear that classical denoising does not work well on these noisy images, and we need a better approach.
One-Step Denoising
The stage 1 UNet of the DeepFloyd diffusion model is a pretrained denoiser that predicts the Gaussian noise
added to the image. Thus, removing the predicted noise from the noisy image will result in an estimation that
is close to the original clean image.
As the UNet is conditioned on the amount of Gaussian noise by timestep t, I passed in t = [250, 500, 750]
with the corresponding noisy images, and got the following results:
Noisy Campanile at t = 250
Noisy Campanile at t = 500
Noisy Campanile at t = 750
One-Step Denoised Campanile at t = 250
One-Step Denoised Campanile at t = 500
One-Step Denoised Campanile at t = 750
The result of this one-step denoising is already much better than the results of classical denoising. However, the quality
dropoff happens when t
is higher and the image is very noisy.
Iterative Denoising
The reason of quality dropoff is that diffusion models are designed to denoise iteratively instead of only one step.
Following the formula below, at each monotonically decreasing timestep in the list strided_timesteps
, the model estimates a less noisy
image by removing partial predicted noise from the noisy image, until t
is close to 0 and the image is clean.
In this case, I started at t = 990
with a stride of 30.
Noisy Campanile at t = 90
Noisy Campanile at t = 240
Noisy Campanile at t = 390
Noisy Campanile at t = 540
Noisy Campanile at t = 690
Original
Iteratively Denoised Campanile
One-Step Denoised Campanile
Gaussian Blurred Campanile
Comparing the outputs of the three denoising methods, iterative denoising produces the clearest result that is also the most similar to the original image.
Diffusion Model Sampling
Using the same iterative denoising as above, we can also generate image from scratch. Starting from pure noise, timestep strided_timesteps[0]
,
and the prompt "a high quality photo" the diffusion model will denoise this pure noise while following the prompt.
Sample 1
Sample 2
Sample 3
Sample 4
Sample 5
Classifier-Free Guidance (CFG)
The 5 generated images in the last part are reasonable "real photos", but look bland and have relatively low color saturation.
Classifier-Free Guidance (CFG) improves the output quality by computing both compute both a conditional and an unconditional noise estimate,
denoted ε_c
and ε_u
. By following the formula below and setting γ = 7
, CFG will push the actual noise estimate
toward the direction of the conditional noise estimate, thus essentially increasing the effectiveness of prompt conditioning.
Sample 1 with CFG
Sample 2 with CFG
Sample 3 with CFG
Sample 4 with CFG
Sample 5 with CFG
While following the same steps as in the last part, the resulting image with CFG and γ = 7
has much higher quality.
Image-to-image Translation
With iterative denoising and CFG, we can use a clean test image as the base image, add various amount noise to it, and then denoise without any conditioning.
The result of this SDEdit algorithm will be images similar to the test image, with higher value of i_start
in strided_timesteps[i_start]
producing more
similar images.
Editing Hand-Drawn and Web Images
The procedure in the last part works vert well if we want to project a nonrealistic image, including painting and sketches, onto the manifold of natural images. I experimented with some of the hand-drawn images, or realistic images with some scribbles on them, and the results are quite interesting.
Inpainting
Another way to use this procedure is inpainting, where we can define a mask on a clean image, such that at each step of the diffusion denoising loop, the pixels with mask equal to 0 is replaced with the corresponding pixels in the original image. This results in only the pixels with mask value 1 to be replaced with a new generated image. The results are really interesting!
Text-Conditional Image-to-image Translation
The image-to-image translation procedure can also be conditioned by text prompts other than the baseline "a high quality photo". By adding controls using text,
the output images gradually look more like the original image when i_start
is higher, and look more like the text prompt when i_start
is lower. When i_start
is around 10, the output image is a nice hybrid between the original image and the text prompt.
"a rocket ship"
"a photo of a dog"
"a man wearing a hat"
Visual Anagrams
By manipulating the noise estimate at each timestep, diffusion models can produce many interesting results. To generate a visual anagram that looks like one thing, but when flipped upside down will reveal another thing, I combined the noise estimate at each timestep for two prompts, and combined one noise with another noise but flipped. The results are shown below.
"an oil painting of people around a campfire"
"an oil painting of an old man"
"an oil painting of a snowy mountain village"
"a man wearing a hat"
"a photo of the amalfi cost"
"a photo of a hipster barista"
Hybrid Images
Another fun thing to do with diffusion models is generating hybrid images, which display one image when looking far away, and another image when closing up.
Instead of averaging noise estimate at each timestep for two prompts, I created composite noise estimate by combining low frequencies from one noise estimate
with high frequencies of the other. For both the high-pass filter and the low-pass filter, I set kernel_size = 33
and sigma = 2
. The
result images indeed have the hybrid image behavior!
Low-pass: "a lithograph of a skull"
High-pass: "a lithograph of waterfalls"
Low-pass: "a photo of the amalfi cost"
High-pass: "a photo of a dog"
Low-pass: "an oil painting of an old man"
High-pass: "an oil painting of people around a campfire"
Part B: Diffusion Models from Scratch
In part B, I trained diffusion models on MNIST from scratch, and added various conditioning to the UNets.
Training a Single-Step Denoising UNet
I first started with implementing a simple one-step denoiser. The denoiser is trained on optimizing the following L2 loss:
D_θ(z)
is the output of the denoiser, and x
is the clean image.
Implementing the UNet
I implemented the denoiser as a UNet with the following downsampling and upsampling blocks with skip connections as shown below.
The operations in the diagram above are the following:
Using the UNet to Train a Denoiser
In the forward noising process, I passed in training pairs (z, x)
, where the noisy image z
is created by adding
a Gaussian noise to the training image x
for each training batch. σ
is a hyperparameter indicating the amount
of noise added.
For the actual training process, I used the following parameters:
sigma = 0.5
batch_size = 256
num_epochs = 5
Adam optimizer with learning_rate = 1e-4
UNet with hidden dimension num_hidden = 128
Some sample results after the 1st and 5th epoch are displayed below:
Results After 1 Epoch of Training
Results After 5 Epochs of Training
Out-of-Distribution Testing
While this denoiser is trained on MNIST digits noised with σ = 0.5
, I also test the denoiser on denoising test set digits
with varying levels of noise.
Results with Varying Noise Levels
The result showed that this denoiser is not generalized enough, and performs poorly with higher σ
.
Training a Time-Conditioning Diffusion Model
To train the full diffusion model similar to DeepFloyd, I changed the UNet to predict the added noise ε
instead of the clean image. The new loss function is the following:
ε_θ(z)
is the noise estimate of the diffusion UNet.
Instead of one-step denoising like the denoiser in the last part, iteratively denoise the image generally yield better results, similar to
the conclusion in part A. Therefore, we need to inject the scalar t
into the UNet to make it time-conditioned.
Since we only need to generate MNIST digits of size 28*28, I set T = 300
instead of 1000, and defined α
and β
in ddpm_schedule
following the below procedure similar to part A.
The new UNet architecture is shown below. I inject the time-conditioning signal with a new operator FCBlock
:
Training the UNet
I trained the UNet using the following training algorithm and parameters.
batch_size = 128
num_epochs = 20
Adam optimizer with initiallearning_rate = 1e-3
Exponential learning rate decay scheduler scheduler = torch.optim.lr_scheduler.ExponentialLR(optimizer, 0.1 ** (1.0 / num_epochs))
UNet with hidden dimension num_hidden = 64
Sampling from the UNet
Sampling using the following algorithm on epoch 5 and epoch 20, we can see the progress of training.
Results After 5 Epochs of Training
Results After 20 Epochs of Training
Notice that the results had a clear and distinguishable shape of digits, but we have no ways to control which digit the model output.
Adding Class-Conditioning to UNet
To add more control over what digit the model generates, I also tried conditioning the UNet on the class of the digits 0-9, by adding 2 more FCBlock
to the UNet. The new FCBlock
s take a one-hot conditioning vector on the class of the digit 0-9 as input. To retain the function of generating without the
class condition, I set the dropout rate to 0.1, such that 10% of the time, the class conditioning vector is set to all zero, thus effectively removing the conditioning.
Following the below training algorithm with the conditioning vector c
, I trained the new class-conditioned UNet with the same parameters
as above.
Sampling from the Class-Conditioned UNet
For the sampling process, besides using the conditioning vector c
to control the digit generated, we also need CFG since part A shows that
conditioning need classifier-free guidance to be effective. I used CFG with γ = 5.0
Results After 5 Epochs of Training
Results After 20 Epochs of Training
As we can see, the result this time, especially after 20 epochs of training, is much better in both quality and control.
This project is easily the most interesting one out of all the projects. The fact that while the reason that CFG greatly improves the conditioned result is still up to vigorous debate, it simply "just works" is particularly fascinating to me!