For Part 0, I explored the DeepFloyd IF text-to-image model using three custom prompts. Stage 1 generates low-resolution (64×64) images, and Stage 2 upsamples them to 256×256 for higher quality.
a king and a queen on top of the worlda rat cooking ratatouille in a detailed kitchena mouse holding a tiny umbrella
I used a fixed random seed for all images so the results are reproducible.
Random seed: SEED_HERE
a king and a queen on top of the world
a rat cooking ratatouille in a detailed kitchen
a mouse holding a tiny umbrella
a king and a queen on top of the world
a rat cooking ratatouille in a detailed kitchen
a mouse holding a tiny umbrellaStage 1 roughly captures the global structure and colors but is extremely blurry. Stage 2 preserves that structure while adding sharp edges, textures, and small details. Across different prompts, the model consistently understands the main objects and scene layout, but sometimes makes stylistic choices (e.g., cartoon vs. realistic) that aren't explicitly specified in the text.
In this part I implemented the forward diffusion process
noisy_im = forward(im, t), which gradually corrupts a clean image
by adding Gaussian noise controlled by a schedule ā_t.
The Campanile image below shows what the same picture looks like at different
noise levels.
As t increases the signal slowly disappears and the image approaches pure noise.
This is the process that the denoising model will later have to invert.
Before using any learned model, I tried a purely classical baseline: Gaussian blur. I blurred the noisy Campanile images at different timesteps and compared them side-by-side with the input noise.
Blur can smooth out some grainy noise, but it also destroys edges and structure. It never actually recovers the original image, which is why we need a learned denoiser.
Next I used the Stage-1 DeepFloyd IF UNet (stage_1.unet) as a
learned denoiser. For each timestep, the UNet predicts the noise
ε̂; I then reconstruct an estimate of the clean image
x̂₀ in a single reverse step.
Even a single reverse step with the UNet is already much sharper than the Gaussian blur, but fine details are still missing, and the result is not perfectly faithful to the original. This motivates the use of the full multi-step reverse process.
In this part, I implemented an iterative DDPM-style reverse process using a strided schedule of timesteps (starting at 990 with stride 30 down to 0). Starting from a noisy Campanile (i_start = 10), the denoising loop gradually removes noise.
Below is a comparison between iterative denoising (multi-step), one-step denoising, and Gaussian blur. The iterative method produces the cleanest and most coherent reconstruction.
Once the iterative reverse process worked on real images, I turned it into a
generator. I started from pure Gaussian noise
x_T ~ N(0, I) and ran the same denoising loop all the way to t = 0
(with the prompt "a high quality photo").
This produces completely new images drawn from the model’s learned distribution.
Because this Stage-1 model operates at low resolution and is later upsampled by Stage-2, these samples look like abstract, low-frequency silhouettes and landscapes, but they are generated purely from noise.
The basic samples above are often blurry or off prompt. To steer the generation more strongly toward a text prompt, I implemented classifier-free guidance (CFG). The idea is to run the UNet twice:
p_c)p_u, the empty prompt)
The two predicted noises εc, εu
are combined as:
ε = εu + γ (εc - εu)
where γ = 7 controls guidance strength.
I plugged this guided noise estimate into the same iterative sampler as before.
With CFG, the images become much sharper and more aligned with the text
"a high quality photo".
Compared to the unguided samples in Part 1.5, classifier-free guidance produces images with stronger structure, contrast, and recognizable silhouettes, showing how much control a simple guidance trick can add to diffusion sampling.
Two illusions that change appearance when flipped upside down.
Hybrid images generated using low-pass and high-pass diffusion guidance.