Assignment #3 - Cats Generator Playground - 16-726 Learning-Based Image Synthesis / Spring 2025

(Results are generated by this work)

Award Winners!

We’ve completed the homework, grading, and voting, and the winner of our best assignment is George Wei!

Honorable Mentions go to Xiao Fang and Tanisha Gupta. Great work to all and check out the winning projects!

Introduction

In this assignment, you will get hands-on experience coding and training GANs. This assignment includes two parts: in the first part, we will implement a specific type of GAN designed to process images, called a Deep Convolutional GAN (DCGAN). We will train the DCGAN to generate grumpy cats from samples of random noise. In the second part, we will implement a more complex GAN architecture called CycleGAN for the task of image-to-image translation (described in more detail in Part 2). We will train the CycleGAN to convert between different types of two kinds of cats (Grumpy and Russian Blue), and between apples and oranges. In both parts, you will gain experience implementing GANs by writing code for the generator, discriminator, and training loop, for each model. Code and data can be found here.

Part 1: Deep Convolutional GAN

For the first part of this assignment, we will implement a slightly modified version of Deep Convolutional GAN (DCGAN). A DCGAN is simply a GAN that uses a convolutional neural network as the discriminator, and a network composed of transposed convolutions as the generator. In the assignment, instead of using transposed convolutions, we will be using a combination of a upsampling layer and a convoluation layer to replace transposed convolutions. To implement the DCGAN, we need to specify three things: 1) the generator, 2) the discriminator, and 3) the training procedure. We will develop each of these three components in the following subsections.

Implement Data Augmentation

DCGAN will perform poorly without data augmentation on a small-sized dataset because the discriminator can easily overfit to a real dataset. To rescue, we need to add some data augmentation such as random crop and random horizontal flip. You need to fill in vanilla version of data augmentation in data_loader.py. We provide some script for you to begin with. You need to compose them into a transform object which is passed to CustomDataset.

    elif opts.data_preprocess == 'vanilla':
        # add addtional data augmentation here
        # load_size = int(1.1 * opts.image_size)
        # osize = [load_size, load_size]
        # transforms.Resize(osize, Image.BICUBIC)
        # transforms.RandomCrop(opts.image_size)
        # transforms.RandomHorizontalFlip()
        pass

Implement the Discriminator of the DCGAN

The discriminator in this DCGAN is a convolutional neural network with the following architecture:

Padding: In each of the convolutional layers shown above, we downsample the spatial dimension of the input volume by a factor of 2. Given that we use kernel size K = 4 and stride S = 2, what should the padding be? Write your answer on your website, and show your work (e.g., the formula you used to derive the padding).

Implementation: Implement this architecture by filling in the __init__ and forward method of the DCDiscriminator class in models.py, shown below. The conv_dim argument does not need to be changed unless you are using larger images, as it should specify the initial image size.

 def __init__(self, conv_dim=64):
       super(DCDiscriminator, self).__init__()
       ###########################################
       ##   FILL THIS IN: CREATE ARCHITECTURE   ##
       ###########################################
       # self.conv1 = conv(...)
       # self.conv2 = conv(...)
       # self.conv3 = conv(...)
       # self.conv4 = conv(...)
       # self.conv5 = conv(...)
    
 def forward(self, x):
     """Outputs the discriminator score given an image.
    
         Input
         -----
             x: BS x 3 x 64 x 64
    
         Output
         ------
             out: BS x 1 x 1 x 1
     """
     ###########################################
     ##   FILL THIS IN: FORWARD PASS   ##
     ###########################################
     pass

Note: The function conv in models.py has an optional argument norm: if norm is none, then conv simply returns a torch.nn.Conv2d layer; if norm is instance/batch, then conv returns a network block that consists of a Conv2d layer followed by a torch.nn.InstanceNorm2d/BatchNorm2d layer. Use the conv function in your implementation.

Generator

Now, we will implement the generator of the DCGAN, which consists of a sequence of upsample+convolutional layers that progressively upsample the input noise sample to generate a fake image. The generator in this DCGAN has the following architecture:

Implementation: Implement this architecture by filling in the __init__ and forward method of the DCGenerator class in models.py. Note: Use the up_conv function (analogous to the conv function used for the discriminator above) in your generator implementation. We find that for the first layer (up_conv1) it is better to directly apply convolution layer without any upsampling to get 4x4 output. To do so, you’ll need to think about what you kernel and padding size should be in this case. Feel free to us up_conv for the rest of the layers.

Training Loop

Next, you will implement the training loop for the DCGAN. A DCGAN is simply a GAN with a specific type of generator and discriminator; thus, we train it in exactly the same way as a standard GAN. The pseudo-code for the training procedure is shown below. The actual implementation is simpler than it may seem from the pseudo-code: this will give you practice in translating math to code.

Implementation: Open up the file vanilla_gan.py and fill in the indicated parts of the training_loop function, starting at the line where it says:
```
# FILL THIS IN
# 1. Compute the discriminator loss on real images # D_real_loss = ...
```
There are 5 numbered bullets in the code to fill in for the discriminator and 3 bullets for the generator. Each of these can be done in a single line of code, although you will not lose marks for using multiple lines.

Differentiable Augmentation

To further improve the data efficiency of GANs, one can apply differentiable augmentations discussed in this paper. Similar to the previous data augmentation scheme, the idea is to reduce overfitting in the discriminators by applying augmentation, but this time we apply augmentation to both the real and fake images during training time. The differentiable augmentation code is provided in the file diff_augment.py, and you will be applying the code to DCGAN your training using the flag --use_diffaug. In the write up, please show results with and without applying differentiable augmentations, and discuss the difference between two augmentation schemes we discussed, in terms of implementation and effects.

Experiment with DCGANs [30 points]

Train the DCGAN with the command:
```
python vanilla_gan.py
```
The script saves the output of the generator for a fixed noise sample every 200 iterations throughout training; this allows you to see how the generator improves over time. Include the following in your website:
- Screenshots of discriminator and generator training loss with --data_preprocess=vanilla. Also show results trained both with and without differentiable augmentation --use_diffaug, so you will show 4 curves in total ([Generator, Discriminator] x [vanilla, vanilla & use_diffaug]). Briefly explain what the curves should look like if GAN manages to train.
- With --data_preprocess=vanilla and differentiable augmentation enabled, show one of the samples from early in training (e.g., iteration 200) and one of the samples from later in training, and give the iteration number for those samples. Briefly comment on the quality of the samples, and in what way they improve through training.

Part 2: Diffusion Model

In this part of the assignment, you will implement and train a Denoising Diffusion Probabilistic Model (DDPM), a type of generative model that iteratively denoises images from pure noise. Unlike GANs, diffusion models do not rely on adversarial training but instead learn a noise-based generative process. This part will give you hands-on experience with key components such as UNet, noise schedules, and sampling.

Overview

Your task will be to complete the provided diffusion model code, train the model, and generate samples. Specifically, you will:

Implement the UNet model for diffusion in diffusion_model.py.
Fill in noise scheduling and diffusion utilities in diffusion_utils.py.
Complete the training loop in train_ddpm.py.
Test the results using the sampling function in test_ddpm.py.

Understanding the Diffusion Model

The model consists of two key processes:

Forward Diffusion Process: Gaussian noise progressively corrupts a clean image.
Reverse Process: A trained neural network (UNet) learns to denoise the corrupted image step by step.

Your implementation will follow these steps:

Define a noise schedule (variance across timesteps).
Train a UNet model to predict noise added at each timestep.
Sample images by reversing the noise process.

DDPM Training Objective

A Denoising Diffusion Probabilistic Model (DDPM) proceeds in discrete timesteps $t = 1, 2, \ldots, T$. The forward diffusion process gradually adds Gaussian noise to a clean image $x_0$, producing $x_1, x_2, \ldots, x_T$. One can directly sample:

\[x_t = \sqrt{\overline{\alpha}_t}\,x_0 + \sqrt{1 - \overline{\alpha}_t}\,\epsilon,\]

where $\epsilon \sim \mathcal{N}(0, I)$ and

\[\overline{\alpha}_t = \prod_{s=1}^t (1 - \beta_s).\]

Here, $\beta_s$ is the noise variance at step $s$, and $\overline{\alpha}_t$ is the cumulative product of $(1 - \beta_s)$ up to time $t$.

The reverse (denoising) process trains a UNet $\epsilon_\theta$ to predict the noise $\epsilon$ added at each timestep. A common simplified training objective is given by the mean-squared error (MSE) between the true noise $\epsilon$ and the predicted noise $\epsilon_\theta$. Formally, we minimize:

\[\mathcal{L}_{\text{simple}}(\theta) = \mathbb{E}_{t \sim \{1,\dots,T\},\, x_0 \sim q(x_0),\, \epsilon \sim \mathcal{N}(0, I)} \bigl[\|\epsilon - \epsilon_\theta(x_t, t)\|^2 \bigr],\]

where

\[x_t = \sqrt{\overline{\alpha}_t}\,x_0 + \sqrt{1 - \overline{\alpha}_t}\,\epsilon.\]

At each training step:

We sample a clean image $x_0$.
Choose a random timestep $t$.
Add noise $\epsilon$ to get $x_t$.
Predict the noise $\epsilon_\theta(x_t, t)$ using the UNet.
Compute the loss $|\epsilon - \epsilon_\theta(x_t, t)|^2$.

By iterating this procedure, the model learns how to remove noise step-by-step and can generate images starting from pure noise by reversing the diffusion process.

Implementing the UNet (diffusion_model.py)

The UNet architecture is essential for the reverse diffusion process. You will implement missing parts of the UNet model:

We have provided the layers in the __init__ method of Unet.
Implement the forward pass to take in noisy images and predict noise.

Where to modify:

class Unet(nn.Module):

    def forward(self, x, time, x_self_cond=None):
 
        # TODO: Implement the forward pass for the downsampling blocks
        for block1, block2, attn, downsample in self.downs:
            x = ...

        # TODO: Implement the forward pass for the middle block
        x = ...

        # TODO: Implement the forward pass for the upsampling blocks
        for block1, block2, attn, upsample in self.ups:
            x = ...

Noise Scheduling and Diffusion Process (diffusion_model.py)

In this section, you will implement the beta schedule and noise scheduling functions used in the diffusion process. The beta schedule determines how much noise is added at each timestep, and the noise schedule helps define the forward and reverse diffusion processes.

Where to modify:

# TODO: Implement beta schedule function
# Hint: you can make use of torch.linspace
def beta_schedule(timesteps):
    beta_start = 0.0001
    beta_end = 0.02
    return ...

Training Loop (diffusion_model.py & train_ddpm.py)

Now, you will implement the training loop:

Compute the loss function (difference between predicted noise and true noise).
Optimize the UNet to denoise images progressively.

Where to modify:

# TODO: Implement loss function
def p_losses(denoise_model, x_start, t, noise=None):

    # TODO: get the predicted noise from the denoise model
    predicted_noise = ...

    # TODO: calculate the loss using the predicted noise and the noise
    loss = ...

and you will use the defined loss in the training process in train_ddpm.py:

Where to modify:

#######################################
###         TRAIN THE UNET         ####
#######################################

# 1. Sample t uniformally for every example in the batch
t = torch.randint(low=0, high=opts.denoising_steps, size=(real_images.shape[0],), device=device).long()

# 2. Get loss between loss and predicted loss
# TODO: calculate the loss using p_losses
loss = ...

Training the Diffusion Model

Once you have implemented the forward and reverse processes, train the model using train_ddpm.py.

Steps:

Run the following command to start training:

python train_ddpm.py --epochs 2000 --batch_size 4 --data_preprocess vanilla

You only need to train the model with --data_preprocess vanilla and it is set as default.

The script will save model checkpoints in checkpoints_ddpm.
Monitor the loss curve, you can visualize it in tensorboard.

Diffusion Model Experiments [30 points]

After training, generate images using the trained diffusion model with test_ddpm.py.

Steps:

Run the following command to generate images:

python test_ddpm.py

The generated images will be saved in diffusion_outputs/.
Include example results in your final submission.

Comparisons with GAN model [10 points]

For your report, compare the performance of DDPM vs. DCGAN:

Key Questions:

How does the quality of diffusion-generated images compare to GAN-generated images?
What are the trade-offs in training efficiency and sample diversity?
Discuss the strengths and weaknesses of each model in your report.

Part 3: CycleGAN

Now we are going to implement the CycleGAN architecture.

Data Augmentation

Remember to set the --data_preprocess flag to vanilla or feel free to add differentiable augmentation or your additional data augmentation.

Generator

The generator in the CycleGAN has layers that implement three stages of computation: 1) the first stage encodes the input via a series of convolutional layers that extract the image features; 2) the second stage then transforms the features by passing them through one or more residual blocks; and 3) the third stage decodes the transformed features using a series of transposed convolutional layers, to build an output image of the same size as the input. The residual block used in the transformation stage consists of a convolutional layer, where the input is added to the output of the convolution. This is done so that the characteristics of the output image (e.g., the shapes of objects) do not differ too much from the input. Implement the following generator architecture by completing the __init__ method of the CycleGenerator class in models.py.

def __init__(self, conv_dim=64, init_zero_weights=False):
     super(CycleGenerator, self).__init__()
    ###########################################
    # 1. Define the encoder part of the generator
    # self.conv1 = ...
    # self.conv2 = ...
    # 2. Define the transformation part of the generator
    # self.resnet_block = ...
    # 3. Define the decoder part of the generator
    # self.up_conv1 = ...
    # self.up_conv2 = ...

To do this, you will need to use the conv and up_conv functions, as well as the ResnetBlock class, all provided in models.py. Note: There are two generators in the CycleGAN model, $G_{X\to Y}$ and $G_{Y\to X}$, but their implementations are identical. Thus, in the code, $G_{X\to Y}$ and $G_{Y\to X}$ are simply different instantiations of the same class.

PatchDiscriminator

CycleGAN adopts a patch-based discriminator. Instead of directly classifying an image to be real or fake, it classifies the patches of the images, allowing CycleGAN to model local structures better. To achieve this effect, you will want the discriminator to produce spatial outputs (e.g., 4x4) instead of a scalar (1x1). We ask you to implement this discriminator architecture by completing the PatchDiscriminator class in models.py. It turns out this can be done by slightly modifying the DCDiscriminator class. (Hint: You can implement a PatchDiscriminator essentially by removing a layer from the DCDiscriminator.)

CycleGAN Training Loop

Finally, we will implement the CycleGAN training procedure, which is more involved than the procedure in Part 1.

Similarly to Part 1, this training loop is not as difficult to implement as it may seem. There is a lot of symmetry in the training procedure, because all operations are done for both X → Y and Y → X directions. Complete the training_loop function in cycle_gan.py, starting from the following section:

# ============================================
#            TRAIN THE DISCRIMINATORS
# ============================================
#########################################
##             FILL THIS IN            ##
#########################################

# 1. Compute the discriminator losses on real images
# D_X_loss = ...
# D_Y_loss = ...

There are 5 bullet points in the code for training the discriminators, and 6 bullet points in total for training the generators. Due to the symmetry between domains, several parts of the code you fill in will be identical except for swapping X and Y ; this is normal and expected.

Cycle Consistency

The most interesting idea behind CycleGANs (and the one from which they get their name) is the idea of introducing a cycle consistency loss to constrain the model. The idea is that when we translate an image from domain $X$ to domain $Y$, and then translate the generated image back to domain $X$, the result should look like the original image that we started with. The cycle consistency component of the loss is the mean squared error between the input images and their reconstructions obtained by passing through both generators in sequence (i.e., from domain $X$ to $Y$ viathe $X \to Y$ generator, and then from domain $Y$ back to $X$ via the $Y \to X$ generator). The cycle consistency loss for the $Y \to X \to Y$ cycle is expressed as follows:

\[\frac{1}{m}\sum_{i=1}^m ||y^{(i)} - G_{X\to Y}(G_{Y\to X}(y^{(i)}))||_p\]

The loss for the $X \to Y \to X$ cycle is analogous. Here the traditional choice of $p$ is 1 but you can try 2 as well if you vary your $\lambda_{\text{cycle}}$. Implement the cycle consistency loss by filling in the following section in cycle_gan.py. Note that there are two such sections, and their implementations are identical except for swapping $X$ and $Y$. You must implement both of them.

if opts.use_cycle_consistency_loss:
    # 3. Compute the cycle consistency loss (the reconstruction loss)
    # cycle_consistency_loss = ...
    g_loss += cycle_consistency_loss

CycleGAN Experiments [30 points]

Training the CycleGAN from scratch can be time-consuming if you do not have a GPU. In this part, you will train your models from scratch for just 1000 iterations, to check the results.

Train the CycleGAN without the cycle-consistency loss from scratch using the command:
```
python cycle_gan.py --disc patch --train_iters 1000
```
This runs for 1000 iterations, and saves generated samples in the output/cyclegan folder. In each sample, images from the source domain are shown with their translations to the right. Include in your website the samples from both generators at either iteration 800 or 1000, e.g., sample-001000-X-Y.png and sample-001000-Y-X.png. 1. Train the CycleGAN with the cycle-consistency loss from scratch using the command:
```
python cycle_gan.py --disc patch --use_cycle_consistency_loss  --train_iters 1000
```
Similarly, this runs for 1000 iterations, and saves generated samples in the output/cyclegan folder. Include in your website the samples from both generators at either iteration 800 or 1000 as above.
If the previous looks reasonable, it is time to train longer time. Please show results after training 10000 iterations. Include the sampled output from your model.
Also train on the apple2orange dataset. You can do this by setting the flag --X apple2orange/apple and --Y apple2orange/orange.
Do you notice a difference between the results with and without the cycle consistency loss? Write down your observations (positive or negative) in your website. Can you explain these results, i.e., why there is or isn’t a difference between the two?

Perform the comparisons of 4 on both the grumpifyCat and the apple2orange dataset.

What you need to submit

Four code files: models.py, vanilla_gan.py, data_loader.py and cycle_gan.py, plus the 4 diffusion model files diffusion_model.py, diffusion_utils.py, train_ddpm.py, test_ddpm.py.
A website submitted like the previous two assignments following the instructions here containing samples generated by your DCGAN, CycleGAN models,and diffusion model, and your answers to the written questions as specified in the previous sections.

Bells & Whistles (Extra Points)

Max of 12 points from the bells and whistles.

Implement and train a diffusion model on our datasets using your own UNet architecture, or even try out some architectures you developed for the questions above! (do they work well?) (4 pts)
Generate samples using a pre-trained diffusion model. (5 pts)
come up with a way of using differentiable augmentation for diffusion model training. (4 pts)
Get your GAN and/or CycleGAN to work on a dataset of your choice. You can curate your own dataset, or collect other datasets from repos like (1, 2, 3). (2pts)
Apply spectral normalization (2pts) on your GANs for stability.
Do something cool with your model: Generate a GIF video or create a meme using your model (You can add your text manually.)? Find directions in the latent space that can change the image in a meaningful way. (up to 4 pts)
Train your GAN to generate higher-resolution images (up to 2 pts) They are available at this link.
Find an improvement to the loss for DCGAN or CycleGAN and implement it. (4 pts)
Use a different type of generative model (like a VAE or autoregressive model) for the same task. (up to 8 pts)
Train the CycleGAN with the DCDiscriminator for comparison:
```
  python cycle_gan.py --disc dc --use_cycle_consistency_loss
```
Compare and report your observations between the results using DCDiscriminator and PatchDiscriminator. Can you explain the results? (4 pts)
Your own ideas you have cleared with the TAs.

Further Resources

Acknowledgement: The assignment is credit to Roger Grosse’s Toronto CSC 321 assignment 4.

Late Policy

Award Winners!

Introduction

Part 1: Deep Convolutional GAN

Implement Data Augmentation

Implement the Discriminator of the DCGAN

Generator

Training Loop

Differentiable Augmentation

Experiment with DCGANs [30 points]

Part 2: Diffusion Model

Overview

Understanding the Diffusion Model

DDPM Training Objective

Implementing the UNet (diffusion_model.py)

Noise Scheduling and Diffusion Process (diffusion_model.py)

Training Loop (diffusion_model.py & train_ddpm.py)

Training the Diffusion Model

Steps:

Diffusion Model Experiments [30 points]

Steps:

Comparisons with GAN model [10 points]

Key Questions:

Part 3: CycleGAN

Data Augmentation

Generator

PatchDiscriminator

CycleGAN Training Loop

Cycle Consistency

CycleGAN Experiments [30 points]

What you need to submit

Bells & Whistles (Extra Points)

Further Resources