CS180 Project 5 — Part A

Diffusion Models & Image Generation

Part 0: Setup – Playing with DeepFloyd IF

For Part 0, I explored the DeepFloyd IF text-to-image model using three custom prompts. Stage 1 generates low-resolution (64×64) images, and Stage 2 upsamples them to 256×256 for higher quality.

Prompts

I used a fixed random seed for all images so the results are reproducible. Random seed: SEED_HERE

Stage 1 Outputs (64×64)

Stage 1: a king and a queen on top of the world
Prompt: a king and a queen on top of the world
Stage 1 (64×64) – rough silhouettes and colors, but composition is visible.
Stage 1: a rat cooking ratatouille in a detailed kitchen
Prompt: a rat cooking ratatouille in a detailed kitchen
Stage 1 (64×64) – the rat + pot are recognizable but very blurry.
Stage 1: a mouse holding a tiny umbrella
Prompt: a mouse holding a tiny umbrella
Stage 1 (64×64) – basic mouse + umbrella shapes with coarse colors.

Stage 2 Outputs (256×256)

Stage 2: a king and a queen on top of the world
Prompt: a king and a queen on top of the world
Stage 2 (256×256) – much sharper details and lighting; characters and background are clearly defined.
Stage 2: a rat cooking ratatouille in a detailed kitchen
Prompt: a rat cooking ratatouille in a detailed kitchen
Stage 2 (256×256) – the rat, pot, and kitchen props become crisp and cartoon-like.
Stage 2: a mouse holding a tiny umbrella
Prompt: a mouse holding a tiny umbrella
Stage 2 (256×256) – very clear mouse character with a colorful umbrella and bokeh background.

Observations

Stage 1 roughly captures the global structure and colors but is extremely blurry. Stage 2 preserves that structure while adding sharp edges, textures, and small details. Across different prompts, the model consistently understands the main objects and scene layout, but sometimes makes stylistic choices (e.g., cartoon vs. realistic) that aren't explicitly specified in the text.

Part 1 — Sampling and Denoising

1.1 Forward Process: Adding Noise

In this part I implemented the forward diffusion process noisy_im = forward(im, t), which gradually corrupts a clean image by adding Gaussian noise controlled by a schedule ā_t. The Campanile image below shows what the same picture looks like at different noise levels.

Original Campanile
Original Campanile
Campanile noisy t=250
Noisy at t = 250
Campanile noisy t=500
Noisy at t = 500
Campanile noisy t=750
Noisy at t = 750

As t increases the signal slowly disappears and the image approaches pure noise. This is the process that the denoising model will later have to invert.


1.2 Classical Denoising: Gaussian Blur

Before using any learned model, I tried a purely classical baseline: Gaussian blur. I blurred the noisy Campanile images at different timesteps and compared them side-by-side with the input noise.

Noisy t=250
Noisy t = 250
Blurred t=250
Gaussian blur (t = 250)
Noisy t=500
Noisy t = 500
Blurred t=500
Gaussian blur (t = 500)
Noisy t=750
Noisy t = 750
Blurred t=750
Gaussian blur (t = 750)

Blur can smooth out some grainy noise, but it also destroys edges and structure. It never actually recovers the original image, which is why we need a learned denoiser.


1.3 One-Step Denoising with a Pretrained UNet

Next I used the Stage-1 DeepFloyd IF UNet (stage_1.unet) as a learned denoiser. For each timestep, the UNet predicts the noise ε̂; I then reconstruct an estimate of the clean image x̂₀ in a single reverse step.

Original
Original Campanile
Noisy t=250
Noisy t = 250
Denoised t=250
One-step denoise (t = 250)
Noisy t=500
Noisy t = 500
Denoised t=500
One-step denoise (t = 500)
Noisy t=750
Noisy t = 750
Denoised t=750
One-step denoise (t = 750)

Even a single reverse step with the UNet is already much sharper than the Gaussian blur, but fine details are still missing, and the result is not perfectly faithful to the original. This motivates the use of the full multi-step reverse process.


1.4 Iterative Denoising (DDPM Reverse Process)

In this part, I implemented an iterative DDPM-style reverse process using a strided schedule of timesteps (starting at 990 with stride 30 down to 0). Starting from a noisy Campanile (i_start = 10), the denoising loop gradually removes noise.

Progress During Iterative Denoising (shown every 5 steps)

Iterative step 5 (t=840)
Step 5 (t = 840)
Iterative step 10 (t=690)
Step 10 (t = 690)
Iterative step 15 (t=540)
Step 15 (t = 540)
Iterative step 20 (t=390)
Step 20 (t = 390)
Iterative step 25 (t=240)
Step 25 (t = 240)

Final Comparison

Below is a comparison between iterative denoising (multi-step), one-step denoising, and Gaussian blur. The iterative method produces the cleanest and most coherent reconstruction.

Gaussian blur baseline
Gaussian blur baseline
One-step denoising
One-step UNet denoising
Final iterative denoising result
Iterative denoising (final)

1.5 Sampling: Generating Images from Pure Noise

Once the iterative reverse process worked on real images, I turned it into a generator. I started from pure Gaussian noise x_T ~ N(0, I) and ran the same denoising loop all the way to t = 0 (with the prompt "a high quality photo"). This produces completely new images drawn from the model’s learned distribution.

Sample 1
Sample 1
Sample 2
Sample 2
Sample 3
Sample 3
Sample 4
Sample 4
Sample 5
Sample 5

Because this Stage-1 model operates at low resolution and is later upsampled by Stage-2, these samples look like abstract, low-frequency silhouettes and landscapes, but they are generated purely from noise.


1.6 Classifier-Free Guidance (CFG)

The basic samples above are often blurry or off prompt. To steer the generation more strongly toward a text prompt, I implemented classifier-free guidance (CFG). The idea is to run the UNet twice:

The two predicted noises εc, εu are combined as:

ε = εu + γ (εc - εu)

where γ = 7 controls guidance strength. I plugged this guided noise estimate into the same iterative sampler as before. With CFG, the images become much sharper and more aligned with the text "a high quality photo".

CFG sample 1
CFG Sample 1
CFG sample 2
CFG Sample 2
CFG sample 3
CFG Sample 3
CFG sample 4
CFG Sample 4
CFG sample 5
CFG Sample 5

Compared to the unguided samples in Part 1.5, classifier-free guidance produces images with stronger structure, contrast, and recognizable silhouettes, showing how much control a simple guidance trick can add to diffusion sampling.

Part 1.7 — Image-to-Image & Inpainting

1.7.1 Editing Hand-Drawn & Web Images

Web Image
Hand-Drawn Image 1 (denoised set)
Hand-Drawn Image 2 (denoised set)
Original drawings

1.7.2 Inpainting

1.7.3 Text-Conditional Image-to-Image

Campanile
Personalized Prompt — Rat (Campanile edit)

Part 1.8 — Visual Anagrams

Two illusions that change appearance when flipped upside down.

Part 1.9 — Hybrid Images

Hybrid images generated using low-pass and high-pass diffusion guidance.