CS180 – Project 4 by Leia

Part 0.1: Calibrating Your Camera

I captured two sets of image pairs with projective transformations by fixing the center of projection and rotating the camera. Each pair has 40-70% overlap for robust registration.

Part 0.2: Capturing a 3D Object Scan (Lafufu)

Lafufu checkerboard calibration — Lafufu – checkerboard calibration view.

Lafufu camera poses visualization — Lafufu – recovered camera poses for the calibration sequence.

Part 0.3: Estimating Camera Pose

Fixed COP, handheld rotation (~60% overlap). Hearst Mining Building facade.

Part 0.4: Undistorting images and creating a dataset

Fixed COP, handheld rotation (~60% overlap). Hearst Mining Building facade.

Part 1: Fitting a Neural Field to a 2D Image

I fit an MLP to a single 2D image of a fox by treating each pixel coordinate as input and predicting its RGB value.

Fox image reconstructed from neural field — Neural field fit to the fox image. The network learns a continuous mapping from (x, y) pixel coordinates to RGB colors.

Part 2.1: Creating Rays from Cameras

For each training image I convert camera intrinsics and extrinsics into a dense grid of rays. A ray is defined by an origin o (camera center in world space) and a direction d (unprojected pixel through the camera). This lets the network operate directly in 3D, independent of image resolution.

# get_rays: K is intrinsics, c2w is camera-to-world transform
def get_rays(H, W, K, c2w):
    i, j = torch.meshgrid(
        torch.arange(W, device=K.device),
        torch.arange(H, device=K.device),
        indexing="xy",
    )
    pixels = torch.stack([(i - K[0, 2]) / K[0, 0],
                          (j - K[1, 2]) / K[1, 1],
                          torch.ones_like(i)], dim=-1)  # (H, W, 3)

    # Rotate into world space and normalize
    dirs = (pixels[..., None, :] @ c2w[:3, :3].T)[..., 0]
    dirs = dirs / torch.norm(dirs, dim=-1, keepdim=True)

    # Same origin for all pixels: camera center in world coords
    origins = c2w[:3, 3].expand_as(dirs)
    return origins, dirs

Lego training image — Example training image from the Lego multi-view dataset.

3D rays visualization — 3D visualization of rays emitted from one camera into the scene.

Each pixel now corresponds to a ray in world space. These rays drive all later steps: sampling points along them, querying the NeRF network, and volume rendering back into pixel colors.

Part 2.2: Sampling Points Along Rays

Along each ray I sample N=64 points between near and far bounds (t ∈ [2.0, 6.0]). The sampling is stratified: I divide the interval into equal bins and jitter a single sample inside each bin. This reduces aliasing and gives smoother reconstructions.

def sample_points(rays_o, rays_d, N_samples, near=2.0, far=6.0):
    R = rays_o.shape[0]
    t_vals = torch.linspace(near, far, N_samples, device=rays_o.device)  # (N,)
    # Stratified jitter within each bin
    mids = 0.5 * (t_vals[:-1] + t_vals[1:])
    upper = torch.cat([mids, t_vals[-1:]], dim=0)
    lower = torch.cat([t_vals[:1], mids], dim=0)
    t_rand = torch.rand((R, N_samples, 1), device=rays_o.device)
    t = (lower[None, :, None] + (upper - lower)[None, :, None] * t_rand)
    # 3D positions along the ray
    pts = rays_o[:, None, :] + rays_d[:, None, :] * t
    step_size = (far - near) / N_samples
    return pts, t, step_size

Rays and samples visualization — Visualization of up to 100 rays and their sampled 3D points.

Depth along rays — Color-coded depth values for sampled points along each ray.

Sampling points turns each ray into a small 1D volume. The NeRF network predicts color and density at these points, which we then integrate with volume rendering.

Part 2.3: Visualizing Cameras, Rays, and Samples

Here I visualize the camera frustums, a subset of rays, and the sampled points used to train the NeRF on the Lego scene.

Camera frustums and rays visualization — Camera frustums and a subset of rays in the Lego scene.

Ray samples close-up view — Close-up of sampled points along rays.

Alternative view of rays and samples — Alternative viewpoint of cameras and ray geometry.

Part 2.4: NeRF Network Architecture

Here I visualize the NeRF MLP that takes in positional encodings of 3D points (and viewing directions) and predicts density and RGB color. The architecture uses several fully connected layers with ReLU activations and skip connections.

NeRF MLP architecture diagram — NeRF architecture used in my implementation. The shared trunk predicts density and features which are combined with viewing directions to predict RGB.

Part 2.5: Volume Rendering

Given per-point densities σᵢ and colors cᵢ along each ray, I implement the discrete volume rendering equation in PyTorch. The key idea is to treat the ray as a semi-transparent volume and compute how much light is absorbed and emitted at each step.

def volrend(sigmas, rgbs, step_size):
    """
    sigmas: (B, N, 1) densities along each ray
    rgbs:   (B, N, 3) colors at those samples
    step_size: scalar distance between samples
    returns: (B, 3) rendered colors
    """
    sigma_delta = sigmas * step_size                 # (B, N, 1)
    alphas = 1.0 - torch.exp(-sigma_delta)          # αᵢ = 1 - exp(-σᵢ δ)

    cumsum_sigma_delta = torch.cumsum(sigma_delta, dim=1)
    accum_before = cumsum_sigma_delta - sigma_delta # ∑_{j<i} σⱼ δ

    T = torch.exp(-accum_before)                    # Tᵢ = exp(-∑_{j<i} σⱼ δ)
    weights = T * alphas                            # wᵢ = Tᵢ αᵢ

    return torch.sum(weights * rgbs, dim=1)         # ∑ wᵢ cᵢ

Intuitively, Tᵢ is the probability the ray has not terminated before sample i, and αᵢ is the probability it terminates at i. Their product gives a weight for each sample, and summing the weighted colors yields the final pixel color. This function is fully differentiable and passes the provided assertion test.

Volume rendering diagram — Diagram of transmittance and alpha along a single ray.

Unit test for volrend — Unit test comparison: my implementation matches the reference values.

Debug render of randomly initialized network using the volume renderer.

Part 2.6: Training NeRF on My Own Captured Object

For this part, I captured my own small scene: a drink placed on a table. I ran COLMAP to recover camera intrinsics and extrinsics, converted them into the NeRF coordinate system, and generated rays exactly as in Parts 2.1–2.3. I trained the same NeRF architecture as before, but tuned a few hyperparameters for this dataset:

Batch size: 10,000 random rays per iteration
Samples per ray: 64 stratified samples
Learning rate: 5e-4 with Adam
Near / far bounds: 2.0 and 6.0 (chosen to tightly bound the object)

The model is optimized with MSE loss between rendered colors and ground-truth pixels. Below I show the loss curve, some intermediate training renders, and a GIF of a camera circling the object to visualize novel views.

Training Loss

Intermediate Renders During Training

NeRF output at iteration 0 — Iteration 0 – random initialization, nearly uniform output.

NeRF output at iteration 500 — Iteration 500 – silhouette and rough colors begin to appear.

NeRF output at iteration 3000 — Iteration 3000 – sharp geometry and textures recovered.

Novel Views: Camera Circling the Object

After training, I rendered a sequence of images by moving a virtual camera in a circular path around the object. These frames are compiled into the GIF below.

Camera circling the drink object (NeRF novel views) — GIF of a camera orbiting the reconstructed drink. The scene remains consistent from all angles, showing that the NeRF has learned a coherent 3D radiance field.

I train using random batches of rays (10K per step) with 64 samples along each ray, Adam optimizer, and MSE loss between rendered and ground-truth pixel colors. After several thousand iterations, the model reaches high PSNR and produces smooth, consistent novel-view renderings.