Model Training Tricks (2)

If you have unlimited data, unlimited memory, and unlimited time, then the advice is easy: train a huge model on all of your data for a really long time. The reason that deep learning is not straightforward is because your data, memory, and time is limited. If you are running out of memory or time, then the solution is to train a smaller model. If you are not able to train for long enough to overfit, then you are not taking advantage of the capacity of your model.

So step one is to get to the point that you can overfit. Then, the question is how to reduce that overfitting.

Many practitioners when faced with an overfitting model start at exactly the wrong end of this diagram. Their starting point is to use a smaller model, or more regularisation. Using a smaller model should be absolutely the last step you take, unless your model is taking up too much time or memory. Reducing the size of your model as reducing the ability of your model to learn subtle relationships in your data.
Instead, your first step should be to seek to create more data. That could involve adding more labels to data that you already have in your organisation, finding additional tasks that your model could be asked to solve (or to think of it another way, identifying different kinds of labels that you could model), or creating additional synthetic data via using more or different data augmentation. Thanks to the development of mixup and similar approaches, effective data augmentation is now available for nearly all kinds of data.
Once you’ve got as much data as you think you can reasonably get a hold of, and are using it as effectively as possible by taking advantage of all of the labels that you can find, and all of the augmentation that make sense, if you are still overfitting and you should think about using more generalisable architectures. For instance, adding batch normalisation may improve generalisation.
If you are still overfitting after doing the best you can at using your data and tuning your architecture, then you can take a look at regularisation. Generally speaking, adding dropout to the last layer or two will do a good job of regularising your model. However, as we learnt from the story of the development of AWD-LSTM, it is often the case that adding dropout of different types throughout your model can help regularise even better. Generally speaking, a larger model with more regularisation is more flexible, and can therefore be more accurate than a smaller model with less regularisation.
Only after considering all of these options would be recommend that you try using smaller versions of your architectures.