← Back to Computer Vision Projects

Diffusion Models

Part A: Exploring the Power of Diffusion!

A.0: Trying it Out

I ended up using seed=28 over the course of this project.

The following are two sets of images produced by DeepFloyd's image diffusion model! The descriptors on which the images were conditioned as well as the number of inference steps I had the model execute are both given.

DeepFloyd Generated Images

poster
7 Inference Steps at Each Stage
poster
50 Inference Steps at Each Stage

Notably, increasing the number of inference steps improved the quality, specificity, and image-to-noise ratio.

A.1: Sampling Loops

1) Forward Process Implementation (Noising)

The first thing I did is experiment with nosing the following are my base image (only 64x64 for runtime purposes), and noised images at varying noise levels.

Formula used for noising

Base Image

Noised Base Image

The various noise levels represent indices of "alpha_cumprod", the cummulative product of alphas used to transition an image from clean to noisy.

1.2) Classical/Gaussian Denoising

The naive approach to denoising is Gaussian blur filtering. The results of denoising from each of the above noise levels via this method is shown below.

Noise Level 250

Noise Level 500

Noise Level 750

1.3) One-Step Denoising

In this section, the denoising process is performed using a UNet model. The steps include:

  1. Estimating noise in the noisy image using stage_1.unet.
  2. Removing the estimated noise to recover an approximation of the original image.

Noise Level 250

Below are the results for noise level 250:

Original Image
Original image
Noisy Image
Noisy image for noise level 250
Denoised Image (UNet Estimate)
Denoised image for noise level 250

Noise Level 500

Below are the results for noise level 500:

Original Image
Original image
Noisy Image
Noisy image for noise level 500
Denoised Image (UNet Estimate)
Denoised image for noise level 500

Noise Level 750

Below are the results for noise level 750:

Original Image
Original image
Noisy Image
Noisy image for noise level 750
Denoised Image (UNet Estimate)
Denoised image for noise level 750

1.4) Iterative Denoising

In this section, iterative denoising is applied. The images were first noised to a high level (t = 750) and then iteratively denoised at decreasing noise levels. This process demonstrates the improvement at various stages of denoising, showing how the model progressively reconstructs the original image.

Iterative Denoising Steps

Noisy Campanile at t=690
Noisy Campanile at t=690
Noisy Campanile at t=540
Noisy Campanile at t=540
Noisy Campanile at t=390
Noisy Campanile at t=390
Noisy Campanile at t=240
Noisy Campanile at t=240
Noisy Campanile at t=90
Noisy Campanile at t=90

Comparison of Final Results

Original Image
Original Image
Iteratively Denoised Campanile
Iteratively Denoised Campanile
One-Step Denoised Campanile
One-Step Denoised Campanile
Gaussian Blurred Campanile
Gaussian Blurred Campanile

Comparing the results of iterative denoising, Gaussian denoising, and one-step denoising reveals key differences:

Iterative denoising demonstrates the model's ability to "hallucinate" and reconstruct fine details by leveraging multiple steps, making it the most effective approach in this comparison.

1.5) Diffusion Model Sampling

In this section, the diffusion model is used to generate images from scratch. By setting i_start = 0 and passing in pure random noise, the model iteratively denoises the noise to generate a high-quality image. Below are five sampled results:

Generated Images

Sampled Image 1
Sampled Image 1
Sampled Image 2
Sampled Image 2
Sampled Image 3
Sampled Image 3
Sampled Image 4
Sampled Image 4
Sampled Image 5
Sampled Image 5

1.6) Classifier-Free Guidance (CFG)

In the previous section, the generated images were often of poor quality, and some were nonsensical. To address this, we employ Classifier-Free Guidance (CFG), a technique that can significantly improve image quality at the cost of some image diversity.

CFG works by combining a conditional and an unconditional noise estimate. The new noise estimate is computed as:

CFG Equation

The parameter γ determines the strength of CFG. For γ = 0, the result is an unconditional noise estimate. For γ = 1, the result is a conditional noise estimate. When γ > 1, the model amplifies the differences between these two estimates, often leading to much higher quality images.

To implement this, I modified the iterative_denoise function to include CFG, using a CFG scale of γ = 7. I used the prompt "a high quality photo" to guide the diffusion model and generated the following results:

Generated Images with Classifier-Free Guidance

Sampled Image 1
Sampled Image 1 with CFG
Sampled Image 2
Sampled Image 2 with CFG
Sampled Image 3
Sampled Image 3 with CFG
Sampled Image 4
Sampled Image 4 with CFG
Sampled Image 5
Sampled Image 5 with CFG

1.7) Image-to-image Translation

In this section, I used the diffusion model with Classifier-Free Guidance (CFG) to perform image-to-image translation. Starting from a test image, noise was added at different levels, and the iterative denoising process gradually restored the image while conditioning on the prompt "a high quality photo". Both the noised and denoised versions of each image are displayed together in a single combined image for better comparison.

Test Image 1: Gradual Denoising

Original Image
Test Image 1 Original
Noised and Denoised at t=960
Test Image 1 Noised and Denoised at t=960
Noised and Denoised at t=900
Test Image 1 Noised and Denoised at t=900
Noised and Denoised at t=840
Test Image 1 Noised and Denoised at t=840
Noised and Denoised at t=780
Test Image 1 Noised and Denoised at t=780
Noised and Denoised at t=690
Test Image 1 Noised and Denoised at t=690
Noised and Denoised at t=390
Test Image 1 Noised and Denoised at t=390

Test Image 2: Gradual Denoising

Original Image
Test Image 2 Original
Noised and Denoised at t=960
Test Image 2 Noised and Denoised at t=960
Noised and Denoised at t=900
Test Image 2 Noised and Denoised at t=900
Noised and Denoised at t=840
Test Image 2 Noised and Denoised at t=840
Noised and Denoised at t=780
Test Image 2 Noised and Denoised at t=780
Noised and Denoised at t=690
Test Image 2 Noised and Denoised at t=690
Noised and Denoised at t=390
Test Image 2 Noised and Denoised at t=390

Test Image 3: Gradual Denoising

Original Image
Test Image 3 Original
Noised and Denoised at t=960
Test Image 3 Noised and Denoised at t=960
Noised and Denoised at t=900
Test Image 3 Noised and Denoised at t=900
Noised and Denoised at t=840
Test Image 3 Noised and Denoised at t=840
Noised and Denoised at t=780
Test Image 3 Noised and Denoised at t=780
Noised and Denoised at t=690
Test Image 3 Noised and Denoised at t=690
Noised and Denoised at t=390
Test Image 3 Noised and Denoised at t=390

1.7.1) Editing Hand-Drawn and Web Images

In this section, I experimented with applying edits to non-realistic images, such as hand-drawn sketches and web images, using the iterative denoising process with Classifier-Free Guidance (CFG). Starting from the noisy versions, the images were progressively denoised at different levels of noise (i_start = 1, 3, 5, 7, 10, 20). Note that a higher i_start value corresponds to a lower noise level or a later time step in the denoising process, leading to outputs that are closer to the original image.

Web Image: Gradual Denoising

Original Image
Web Image Original
Noise Level i_start = 1
Web Image i_start=1
Noise Level i_start = 3
Web Image i_start=3
Noise Level i_start = 5
Web Image i_start=5
Noise Level i_start = 7
Web Image i_start=7
Noise Level i_start = 10
Web Image i_start=10
Noise Level i_start = 20
Web Image i_start=20

Hand-Drawn Image 1: Gradual Denoising

Original Image
Hand-Drawn Image 1 Original
Noise Level i_start = 1
Hand-Drawn Image 1 i_start=1
Noise Level i_start = 3
Hand-Drawn Image 1 i_start=3
Noise Level i_start = 5
Hand-Drawn Image 1 i_start=5
Noise Level i_start = 7
Hand-Drawn Image 1 i_start=7
Noise Level i_start = 10
Hand-Drawn Image 1 i_start=10
Noise Level i_start = 20
Hand-Drawn Image 1 i_start=20

Hand-Drawn Image 2: Gradual Denoising

Original Image
Hand-Drawn Image 2 Original
Noise Level i_start = 1
Hand-Drawn Image 2 i_start=1
Noise Level i_start = 3
Hand-Drawn Image 2 i_start=3
Noise Level i_start = 5
Hand-Drawn Image 2 i_start=5
Noise Level i_start = 7
Hand-Drawn Image 2 i_start=7
Noise Level i_start = 10
Hand-Drawn Image 2 i_start=10
Noise Level i_start = 20
Hand-Drawn Image 2 i_start=20

Two key observations:

  1. Noising the image too much prior to diffusion results in a completely new image.
  2. Not noising the image enough yields (close to) the original image.

When the image is noised just right, the model has just enough structure to produce what was requested, but not enough to produce the same as what was requested. Therefore, it fills in the gaps with the image data its been trained on, producing a "realistic" version of the original image!

1.7.2) Inpainting

In this section, I applied inpainting to modify specific parts of an image while preserving other regions. Using a binary mask, areas with a value of 0 remain unchanged, while areas with a value of 1 are edited through the iterative diffusion process. This allows new content to be generated in selected regions, following the methodology described in the RePaint paper.

Setup

Original Image

Inpainting Process

The inpainting process progressively restores the masked area using decreasing noise levels. Below are the intermediate steps and the final result:

Noise Level t=990
Inpainting at t=990
Noise Level t=840
Inpainting at t=840
Noise Level t=690
Inpainting at t=690
Noise Level t=540
Inpainting at t=540
Noise Level t=390
Inpainting at t=390
Noise Level t=240
Inpainting at t=240
Noise Level t=90
Inpainting at t=90
Final Result
Final Inpainting Result

The inpainting process shows how the masked area is gradually modified through successive iterations, leading to a realistic reconstruction that blends seamlessly with the unaltered parts of the image.

1.7.3: Text-Conditional Image-to-image Translation

In this section, I used the same process as SDEdit but added a text prompt to guide the projection. Instead of only projecting to the natural image manifold, the model incorporates a textual description to further influence the output. This process modifies the image to align both with the original content and the prompt. Below are the results for three test images, progressively denoised at different noise levels (t = 960, 900, 840, 780, 690, 390).

Note: Each image combines both the noised and denoised versions for easier comparison.

Test Image 1: Campanile with the Prompt "a rocket ship"

This test starts with an image of the Campanile and applies the text prompt "a rocket ship". The model transforms the image to incorporate elements of a rocket ship while gradually removing noise.

Original Image
Test Image 1 Original
Noise Level t = 960
Test Image 1 t=960
Noise Level t = 900
Test Image 1 t=900
Noise Level t = 840
Test Image 1 t=840
Noise Level t = 780
Test Image 1 t=780
Noise Level t = 690
Test Image 1 t=690
Noise Level t = 390
Test Image 1 t=390

Test Image 2: Minecraft Village with the Prompt "an oil painting of a snowy mountain village"

This test starts with an image of a Minecraft village and applies the text prompt "an oil painting of a snowy mountain village". The model transforms the blocky landscape into a more painterly and snowy aesthetic while gradually removing noise.

Original Image
Test Image 2 Original
Noise Level t = 960
Test Image 2 t=960
Noise Level t = 900
Test Image 2 t=900
Noise Level t = 840
Test Image 2 t=840
Noise Level t = 780
Test Image 2 t=780
Noise Level t = 690
Test Image 2 t=690
Noise Level t = 390
Test Image 2 t=390

Test Image 3: Woman Wearing a Hat with the Prompt "a man wearing a hat"

This test starts with an image of a woman wearing a hat and applies the text prompt "a man wearing a hat". The model adjusts the facial features and appearance to align with the prompt while preserving the overall hat element.

Original Image
Test Image 3 Original
Noise Level t = 960
Test Image 3 t=960
Noise Level t = 900
Test Image 3 t=900
Noise Level t = 840
Test Image 3 t=840
Noise Level t = 780
Test Image 3 t=780
Noise Level t = 690
Test Image 3 t=690
Noise Level t = 390
Test Image 3 t=390

Interestingly, the final results appear to have a hybrid quality, combining elements of the original image and the influence of the text prompt, with the last images being the best blend for the campinile x rocket and mc village x snowy village, and the second last image being best for the woman x man blend. This behavior closely resembles what I aim to achieve explicitly in Section 1.9, where the model generates hybrid images directly from scratch using specific textual guidance.

1.8: Visual Anagrams

In this section, I implemented visual anagrams using the diffusion model. By averaging the noise estimates for two different prompts and orientations (right-side-up and upside-down), the model generates images that look different depending on how they are viewed.

Below are the results for three visual anagrams, each generated with two prompts. The same image is displayed twice: once upright and once flipped upside-down to highlight the illusion.

Image 1

Prompts: "an oil painting of people around a campfire" (right-side-up) and "an oil painting of an old man" (upside-down).

Right-Side-Up: "an oil painting of people around a campfire"
Image 1 Upright
Upside-Down: "an oil painting of an old man"
Image 1 Upside Down

Image 2

Prompts: "a photo of a hipster barista" (right-side-up) and "a rocket ship" (upside-down).

Right-Side-Up: "a photo of a hipster barista"
Image 2 Upright
Upside-Down: "a rocket ship"
Image 2 Upside Down

Image 3

Prompts: "a man wearing a hat" (right-side-up) and "a photo of a man" (upside-down).

Right-Side-Up: "a man wearing a hat"
Image 3 Upright
Upside-Down: "a photo of a man"
Image 3 Upside Down

1.9: Hybrid Images

In this section, I implemented hybrid image generation using the Factorized Diffusion technique. By combining low-frequency components from one noise estimate with high-frequency components from another, the model creates images that appear different depending on the viewing distance.

The total noise used in generating the hybrid images is calculated using the following equation:

Equation for Total Noise

This equation shows how the low-frequency and high-frequency noise components are combined to form the total noise, allowing the model to blend two prompts effectively.

Below are the results for three hybrid images, each generated with two prompts.

Image 1: Skull and Waterfalls

Prompts: "a lithograph of a skull" (low frequencies) and "a lithograph of waterfalls" (high frequencies).

Hybrid Image 1

Image 2: Old Man and People Around a Campfire

Prompts: "an oil painting of an old man" (low frequencies) and "an oil painting of people around a campfire" (high frequencies).

Hybrid Image 2

Image 3: Rocket Ship and Pencil

Prompts: "a rocket ship" (low frequencies) and "a pencil" (high frequencies).

Hybrid Image 3

Similar to project 2, the low-pass filtered images are visible from a distance, the high-pass filtered images are more obvious from up close. Our AI seamlessly integrates the two!

Part B: Exploring the Power of Diffusion!

This part of the project was extremely interesting and ultimately rewarding! I got to write and train my own diffusion model on MNIST digit images.

B.1: Training a Single-Step Denoiser UNET

B.1.1: Visualizing the Noising Process

To begin, we visualized the noising process by applying Gaussian noise to the MNIST dataset at varying levels of σ. This process demonstrates how images are progressively corrupted as the noise level increases. The visualization below shows the results for σ values ranging from 0.0 (no noise) to 1.0 (maximum noise):

Visualization of the Noising Process (σ = [0.0, 0.2, 0.4, 0.5, 0.6, 0.8, 1.0])
Visualization of Noising Process

B.1.2: Training the Single-Step Denoiser

The UNet model was trained to denoise images at a fixed noise level of σ = 0.5. This approach, referred to as "single-step diffusion," does not involve iterative refinement. Instead, the model directly learns to map a noised image to its corresponding clean version in a single step.

During training, the Mean Squared Error (MSE) loss was minimized between the model’s prediction and the original (clean) image. This method is computationally efficient, as the model only needs to handle a single noise level, but it limits its generalizability to unseen noise distributions.

Training Loss Curve Over Iterations
Training Loss Curve

The graph above shows the training loss decreasing steadily over iterations. This indicates that the model is successfully learning to denoise images at σ = 0.5, reducing the error between its predictions and the original images.

Sample Results After 1 Epoch
Sample Results After 1 Epoch

After the first epoch, the model begins to produce results that are recognizable but still noisy. This early stage demonstrates the initial progress made by the model as it starts to understand the mapping from noisy to clean images.

Sample Results After 5 Epochs
Sample Results After 5 Epochs

By the fifth epoch, the model produces significantly improved outputs. The reconstructed images closely resemble the original clean images, indicating that the single-step denoiser has effectively learned to handle noise at σ = 0.5.

B.1.3: Testing on Out-of-Distribution Noise Levels

Finally, we evaluated the model on images noised at different levels of σ, ranging from 0.0 to 1.0. This is referred to as out-of-distribution testing because the model was only trained at σ = 0.5. The results highlight the limitations of the single-step approach, as the model struggles to generalize to noise levels outside its training distribution.

Sample Results on Out-of-Distribution Noise Levels (σ = [0.0, 0.2, 0.4, 0.5, 0.6, 0.8, 1.0])
Sample Results with Out-of-Distribution Noise

While the model performs well at σ = 0.5, its ability to denoise images deteriorates at higher noise levels. This highlights a key limitation of single-step denoising and motivates the exploration of multi-step approaches in future work.

In summary, this section demonstrated the training and evaluation of a single-step denoiser. While effective for its intended noise level, this method is constrained in its generalizability, prompting further exploration of iterative denoising techniques.

B.2: Training a Diffusion Model

This was the most fascinating portion of the project for me! In it, I had the opportunity to code diffusion from scratch! Noisy images were transformed to MNIST digits

Part 1:

In this section, we extend the single-step denoising UNet by introducing time-conditioning. The model is now trained to denoise images at various noise levels by using a time-conditioning parameter. This modification enables the UNet to generalize across multiple noise levels rather than being restricted to a single fixed noise level like before.

However, the MNIST digit labels themselves are not used in this model. As a result, while the denoised images are clean digits, they lack specificity or clarity since the model is not conditioned on any class-based information. This lack of conditioning leads to results that are somewhat blurry, similar to the issues observed in Part A of this project, where insufficient conditioning limited the sharpness of the generated outputs.

Training Process

During training, the UNet was exposed to images corrupted at various noise levels and was tasked with reconstructing the original images. Below are the key results from this process:

Training Loss Curve Over Epochs
Training Loss Curve for Time-Conditioned UNet
Sample Results After 5 Epochs
Sample Results After 5 Epochs
Sample Results After 20 Epochs
Sample Results After 20 Epochs

As seen from the results, the training loss decreases steadily, indicating the model is learning to reconstruct clean images from noisy inputs. However, the lack of digit label information in this implementation means the outputs are clean but not sharply defined. This reinforces the importance of conditioning the model on additional context to improve its generative capabilities.

In the next part, we will introduce class-conditioning, which is expected to significantly improve the results by incorporating label-based guidance to refine the outputs further.

Part 2:

To address the limitations observed in the time-conditioned model, we introduced class conditioning in this section. By utilizing the labels of the MNIST dataset, the model is now trained to not only denoise the images but also to produce specific digit classes. Class conditioning allows the model to utilize label information effectively, leading to sharper and more coherent outputs compared to the unconditioned model.

Class conditioning was incorporated into the UNet architecture by adding two additional fully connected blocks (FCBlocks) to process the class-conditioning vector c. This vector is encoded as a one-hot representation of the digit labels (0-9). To ensure the model can still handle unconditioned scenarios, we applied dropout, where the class-conditioning vector is set to zero 10% of the time (p_uncond = 0.1).

The final model is conditioned on both the time step t and the class c, enabling it to reconstruct images at specific noise levels while also generating digits corresponding to the provided class label. This enabled us to generate from an arbitray noise level (and in particular, iteratively from pure noise) a handwritten digit of optionally specified class.

Training Loss Curve Over Epochs
Training Loss Curve for Class-Conditioned UNet
Sample Results After 5 Epochs
Sample Results After 5 Epochs
Sample Results After 20 Epochs
Sample Results After 20 Epochs

The training loss curve shows a steady decrease, indicating the model effectively learns to denoise and reconstruct specific digit classes from noisy inputs. The results after 5 and 20 epochs demonstrate a significant improvement in the sharpness and clarity of the generated digits compared to the unconditioned model. By explicitly conditioning on both time and class, the outputs become not only clean but also class-specific, addressing the limitations observed in the earlier implementations.

Conclusion

This project provided a comprehensive exploration of diffusion models, I really enjoyed it!