Proj 5 — Part A

Part 0: Setup – Playing with DeepFloyd IF

For Part 0, I explored the DeepFloyd IF text-to-image model using three custom prompts. Stage 1 generates low-resolution (64×64) images, and Stage 2 upsamples them to 256×256 for higher quality.

Prompts

Prompt 1: a king and a queen on top of the world
Prompt 2: a rat cooking ratatouille in a detailed kitchen
Prompt 3: a mouse holding a tiny umbrella

I used a fixed random seed for all images so the results are reproducible. Random seed: SEED_HERE

Stage 1 Outputs (64×64)

Stage 1: a king and a queen on top of the world — Prompt: `a king and a queen on top of the world`
Stage 1 (64×64) – rough silhouettes and colors, but composition is visible.

Stage 1: a rat cooking ratatouille in a detailed kitchen — Prompt: `a rat cooking ratatouille in a detailed kitchen`
Stage 1 (64×64) – the rat + pot are recognizable but very blurry.

Stage 1: a mouse holding a tiny umbrella — Prompt: `a mouse holding a tiny umbrella`
Stage 1 (64×64) – basic mouse + umbrella shapes with coarse colors.

Stage 2 Outputs (256×256)

Stage 2: a king and a queen on top of the world — Prompt: `a king and a queen on top of the world`
Stage 2 (256×256) – much sharper details and lighting; characters and background are clearly defined.

Stage 2: a rat cooking ratatouille in a detailed kitchen — Prompt: `a rat cooking ratatouille in a detailed kitchen`
Stage 2 (256×256) – the rat, pot, and kitchen props become crisp and cartoon-like.

Stage 2: a mouse holding a tiny umbrella — Prompt: `a mouse holding a tiny umbrella`
Stage 2 (256×256) – very clear mouse character with a colorful umbrella and bokeh background.

Observations

Stage 1 roughly captures the global structure and colors but is extremely blurry. Stage 2 preserves that structure while adding sharp edges, textures, and small details. Across different prompts, the model consistently understands the main objects and scene layout, but sometimes makes stylistic choices (e.g., cartoon vs. realistic) that aren't explicitly specified in the text.

1.1 Forward Process: Adding Noise

In this part I implemented the forward diffusion process noisy_im = forward(im, t), which gradually corrupts a clean image by adding Gaussian noise controlled by a schedule ā_t. The Campanile image below shows what the same picture looks like at different noise levels.

As t increases the signal slowly disappears and the image approaches pure noise. This is the process that the denoising model will later have to invert.

1.2 Classical Denoising: Gaussian Blur

Before using any learned model, I tried a purely classical baseline: Gaussian blur. I blurred the noisy Campanile images at different timesteps and compared them side-by-side with the input noise.

Blur can smooth out some grainy noise, but it also destroys edges and structure. It never actually recovers the original image, which is why we need a learned denoiser.

1.3 One-Step Denoising with a Pretrained UNet

Next I used the Stage-1 DeepFloyd IF UNet (stage_1.unet) as a learned denoiser. For each timestep, the UNet predicts the noise ε̂; I then reconstruct an estimate of the clean image x̂₀ in a single reverse step.

Even a single reverse step with the UNet is already much sharper than the Gaussian blur, but fine details are still missing, and the result is not perfectly faithful to the original. This motivates the use of the full multi-step reverse process.

1.4 Iterative Denoising (DDPM Reverse Process)

In this part, I implemented an iterative DDPM-style reverse process using a strided schedule of timesteps (starting at 990 with stride 30 down to 0). Starting from a noisy Campanile (i_start = 10), the denoising loop gradually removes noise.

Progress During Iterative Denoising (shown every 5 steps)

Final Comparison

Below is a comparison between iterative denoising (multi-step), one-step denoising, and Gaussian blur. The iterative method produces the cleanest and most coherent reconstruction.

1.5 Sampling: Generating Images from Pure Noise

Once the iterative reverse process worked on real images, I turned it into a generator. I started from pure Gaussian noise x_T ~ N(0, I) and ran the same denoising loop all the way to t = 0 (with the prompt "a high quality photo"). This produces completely new images drawn from the model’s learned distribution.

Because this Stage-1 model operates at low resolution and is later upsampled by Stage-2, these samples look like abstract, low-frequency silhouettes and landscapes, but they are generated purely from noise.

1.6 Classifier-Free Guidance (CFG)

The basic samples above are often blurry or off prompt. To steer the generation more strongly toward a text prompt, I implemented classifier-free guidance (CFG). The idea is to run the UNet twice:

where γ = 7 controls guidance strength. I plugged this guided noise estimate into the same iterative sampler as before. With CFG, the images become much sharper and more aligned with the text "a high quality photo".

Compared to the unguided samples in Part 1.5, classifier-free guidance produces images with stronger structure, contrast, and recognizable silhouettes, showing how much control a simple guidance trick can add to diffusion sampling.