Model Training Tricks (1)

Normalization

1
2
3
4
5
6
7
8
def get_dls(bs, size):
dblock = DataBlock(blocks=(ImageBlock, CategoryBlock),
get_items=get_image_files,
get_y=parent_label,
item_tfms=Resize(460),
batch_tfms=[*aug_transforms(size=size, min_scale=0.75),
Normalize.from_stats(*imagenet_stats)])
return dblock.dataloaders(path, bs=bs)

Normalization becomes especially important when using pretrained models. The pretrained model only knows how to work with data of the type that it has seen before. If the average pixel was zero in the data it was trained with, but your data has zero as the minimum possible value of a pixel, then the model is going to be seeing something very different to what is intended.

This means that when you distribute a model, you need to also distribute the statistics used for normalization, since anyone using it for inference, or transfer learning, will need to use the same statistics. By the same token, if you’re using a model that someone else has trained, make sure you find out what normalization statistics they used, and match them.

Progressive resizing

Gradually using larger and larger images as you train

Start training using small images, and end training using large images. By spending most of the epochs training with small images, training completed much faster. By completing training using large images, the final accuracy was much higher. We call this approach progressive resizing.

Note that for transfer learning, progressive resizing may actually hurt performance. This would happen if your pretrained model was quite similar to your transfer learning task and dataset, and was trained on similar sized images, so the weights don’t need to be changed much. In that case, training on smaller images may damage the pretrained weights. On the other hand, if the transfer learning task is going to be on images that are of different sizes, shapes, or style to those used in the pretraining tasks, progressive resizing will probably help. As always, the answer to “does it help?” is “try it!”.

1
2
3
4
5
6
7
8
9


dls = get_dls(128, 128)
learn = Learner(dls, xresnet50(), loss_func=CrossEntropyLossFlat(),
metrics=accuracy)
learn.fit_one_cycle(4, 3e-3)

learn.dls = get_dls(64, 224)
learn.fine_tune(5, 1e-3)

Test time augmentation

During inference or validation, creating multiple versions of each image, using data augmentation, and then taking the average or maximum of the predictions for each augmented version of the image

Select a number of areas to crop from the original rectangular image, pass each of them through our model, and take the maximum or average of the predictions. In fact, we could do this not just for different crops, but for different values across all of our test time augmentation parameters.

1
2
preds,targs = learn.tta()
accuracy(preds, targs).item()

Mixup

Mixup works as follows, for each image:

  1. Select another image from your dataset at random
  2. Pick a weight at random
  3. Take a weighted average (using the weight from step 2) of the selected image with your image; this will be your independent variable
  4. Take a weighted average (with the same weight) of this image’s labels with your image’s labels; this will be your dependent variable

In pseudo-code, we’re doing (where t is the weight for our weighted average):

1
2
3
4
image2,target2 = dataset[randint(0,len(dataset)]
t = random_float(0.5,1.0)
new_image = t * image1 + (1-t) * image2
new_target = t * target1 + (1-t) * target2

The third image is built by adding 0.3 times the first one and 0.7 times the second. In this example, should the model predict church? gas station? The right answer is 30% church and 70% gas station since that’s what we’ll get if we take the linear combination of the one-hot encoded targets. For instance, if church has for index 2 and gas station as for index 7, the one-hot-encoded representations are

[0, 0, 1, 0, 0, 0, 0, 0, 0, 0] and [0, 0, 0, 0, 0, 0, 0, 1, 0, 0]
[0, 0, 0.3, 0, 0, 0, 0, 0.7, 0, 0]

1
2
3
4
model = xresnet50()
learn = Learner(dls, model, loss_func=CrossEntropyLossFlat(),
metrics=accuracy, cbs=Mixup)
learn.fit_one_cycle(5, 3e-3)

Label smoothing

In the theoretical expression of the loss, in classification problems, our targets are one-hot encoded (in practice we tend to avoid doing it to save memory, but what we compute is the same loss as if we had used one-hot encoding). That means the model is trained to return 0 for all categories but one, for which it is trained to return 1. Even 0.999 is not good enough, the model will get gradients and learn to predict activations that are even more confident. This encourages overfitting and gives you at inference time a model that is not going to give meaningful probabilities: it will always say 1 for the predicted category even if it’s not too sure, just because it was trained this way. It can become very harmful if your data is not perfectly labeled.

This is how label smoothing works in practice: we start with one-hot encoded labels, then replace all zeros by

where $N$ is the number of classes and $\epsilon$ is a parameter (usually 0.1, which would mean we are 10% unsure of our labels). Since you want the labels to add up to 1, replace the 1 by

.

This way, we don’t encourage the model to predict something overconfident: in our Imagenette example where we have 10 classes, the targets become something like:

[0.01, 0.01, 0.01, 0.91, 0.01, 0.01, 0.01, 0.01, 0.01, 0.01]