Week of June 14th
What I did
- I figured out what was wrong with Colab’s performance off of GPU.
- I did a lab report on EPMA, which involved writing a python script to automate a bunch of conversions from weight percent to atomic percent.
- Learn about batch normalization
What I learned
- Turns out trying to read data from a mounted Google Drive cripples it. Additionally, when using a model with pytorch lightning, you don’t need to do a
model.to(device)
. It takes care of that for you. Calling that command myself was screwing things over. Instead, you need to download the dataset onto the notebook with something like this:
trainset = torchvision.datasets.CIFAR10(root='./data', train=True, download=True, transform=my_transform)
trainloader = torch.utils.data.DataLoader(trainset,shuffle=True, num_workers=2)
or better yet download the proper dataset as follows:
!mkdir /datasets
!mkdir /datasets/cifar10
!cd /datasets/cifar10 && wget https://cdn3.vision.in.tum.de/~dl4cv/cifar10.zip --no-check-certificate
!cd /datasets/cifar10 && unzip -q cifar10.zip
- It’s best to put dropout layers after a ReLU activation and before an affine (nn.Linear()) layer
- You can achieve high scores on CIFAR 10 without convolution if you use bottling layers.
Here the process for building a NN for cifar10
You need to start with a simple model first. Pick some hype hyperparameters to see if you can overfit it (say, make the training set 1 - 10 images). If you can overfit it, then it means your code is working and your architecture makes sense. Next you can up the images to say 1000 and run a broad hyperparameter search. Afterwords, run maybe 20 epochs and see how the training and validation loss are moving. If they both are going down, your architecture looks good. If not, start over.
Once it looks like your training loss is continuing to decrease but your validation loss plateaus, it’s time to start introducing regularization (say nn.Dropout(.5)) and data augmentation. This way you won’t continue to overfit the data and can keep dropping the validation loss.
I haven’t done it yet with the cifar10 set, but I suspect that you should build up a more complex architecture without regularization and make sure you can overfit it to death before tuning hyperparameters and introducing regularization.
If you start with a complex model with data augmentation and regularization you’ll never be able to tune the hyperparameters and find a good solution.
Batch Normalization
Batch normalization is a way to make your network more robust to covariance shift. For example, if you train your cat vs. not cat identifier with only images of black cats, your network won’t make good predictions when it comes to orange cats.
The idea is to normalize each hidden layer, similar to how you normalize the input dataset. However, whereas with normalization of training data centers the dataset around a mean of zero and STD of 1, the mean and variance of the batch normalization are trainable/learnable parameters.
Normalize input data i like:
\[Z_{norm}^{(i)} = \frac{x_i-\mu}{\sqrt{\sigma^2-\epsilon}}\]Normalize the values for hidden layer l like:
\[\widetilde{Z}_i = \gamma Z_{norm}^{(i)} - \beta\]You can see that if \(\sqrt{\sigma^2 + \epsilon}\) and \(\gamma = \mu\), you get the first equation, i.e. you’re normalizing the hidden layer in the same way as the input layer. You generally don’t want to do this, however, because if you’re normalizing everything to be centered around zero, your activation function will mostly focused on the linear regime of the sigmoid.
The structure of implementing batch normalization looks something like this:
first pass
\(x \cdot \theta^T \rightarrow z\)
batch normalize
\(z \rightarrow \widetilde{z}\)
apply activation function
\(g(\widetilde{z}) = a\)
second pass
\(a \cdot \theta^T\)
etc.
Note, this is for one mini batch, so \(x\) is the \(i^{th}\) mini-batch. Normalizing the hidden layers for each batch means that later hidden layers don’t have to adapt as much to the earlier hidden layers. This means that the later layers can do a better job tuning themselves a little more independently of the other layers, and it speeds up the learning process. Note, because you’re scaling each mini-batch \((z \rightarrow \widetilde{z})\), you add a little bit of noise, which is a little bit of regularization. But don’t use batch-norm as a form of regularization. That’s not its purpose.
*Covariate shift is a type of dataset shift where the distribution of training data differs from that of testing, or in this context, batch to batch.
What I will do next
- I need to finish that Code Academy course on chat bots ASAP before my free pro account expires.