Dropout: A simple way to prevent Neural Networks from Overfitting

mediumThis post was originally published by Annanya Vedala at Medium [AI]

Image for post

RESEARCH PAPER OVERVIEW

The purpose of the paper was to understand what dropout layers are and what their contribution is towards improving the efficiency of a neural network. Through this, we see that dropout improves the performance of neural networks on supervised learning tasks in speech recognition, document classification and vision.

Generally, neural networks tend to easily get overfit due to their highly complex nature. Moreover, it is always more efficient to combine two or more types of algorithms to come up with the most optimum solution. In such a case, a neural network becomes extremely complex and it becomes very different to further apply it on a test set. As a result of this, there arose a need to build efficient models with lesser computation and those which are less prone to overfitting. This was the basis behind the development of Dropout layers.

A Dropout essentially refers to the literal dropping out of a few units from the neural network ie., removing all the inputs and outputs associated with that specific unit. The units that have to be removed are chosen based on a specific parameter ‘p’ that assigns each units its specific probability of retention. Usually, this probability is specified with the help of cross validation sets or randomly taken as 0.5 for some cases, and in the case of the input layers, it is usually 1 or closer to 1 than 0.5.

So, we are basically sampling out a thinned portion of the main neural network in this process. A neural net with n units can be seen as a collection of 2n possible thinned networks. For each training case, a new one(thinned network) is sampled and trained. Thus, training a neural network with a dropout layer can be visualised as training a collection of 2n thinned networks with extensive weight sharing, and each thinned network gets trained very less or maybe never gets trained at all. In this process, the weights of all the units are multiplied with their respective probabilities so that the actual output is the same as the predicted output after the dropout is done.

Image for post

UNDERSTANDING DROPOUT NETWORKS

Backpropagation:

Just like any other Neural Network, a Dropout Neural Network functions on minimising the loss function with the help of the stochastic gradient descent. But the difference lies in the mini batch that is used for this purpose. In case of Dropout neural networks, the mini batch size is thinned due to the units that have dropped out. Dropout alone does give evident improvements, however, using dropout along with max norm regularisation and large decaying learning rates at a high momentum facilitates the final result greatly. Due to the noise caused by the dropout, the optimisation process is allowed to explore different regions of the weight space that were otherwise too far fetched. As the learning rate decays, the optimisation takes shorter steps, thus it does less exploration and finally reaches a minimum.

Unsupervised Pre-training:

Dropout can be applied to fine tune networks. The pre-training procedure stays the same. The weights obtained from it should be changed by a factor of 1/p. With the help of this, the expected output will be the same as the output during pre-training.

In this paper, we see that problems like the MNIST dataset, the TIMIT dataset, the CIFAR-10 dataset, CIFAR-100 dataset etc., all show a much better response and efficiency with a dropout neural network in comparison to a basic neural network. In the case of the MNIST dataset, the dropout neural network was able to achieve a decrease in error of about 1.35%. This network has more than 65 million parameters. Training a network of this size is very hard with standard regularisation method. Usually, all dropout neural nets use a p of 0.5 in the hidden layers and 0.8 for the input layer. Dropout gives a huge improvement across all architectures, without using specifically tuned hyperparameters that change from architecture to architecture. This shows that dropout can act as a very good regularisation technique.

Image for post

SPECIFICITY OF AFFECT

1. Affect on features

We see that in a normal neural network ie., one without dropout, there is a dependance on all the units in order to ultimately reduce the loss. This leads to complex co adaptations leading to extremely complex functions. On the other hand, with less number of units that actually depend on the main function to reduce their share of loss, we see that the features are trained faster and more accurately.

2. Affect on Sparsity

When we use dropout neural networks, we automatically see a sparse output coming from the dropped out units. Usually, in neural networks where sparsity is induced, we see that the units that are not made 0 are usually more efficient and contribute largely to the final output. In fact, this is the same case as in dropout since when some units are hidden the burden of the entire function falls only on a specific number of units and they automatically become bigger contributors to the final function.

3. Effect of Dropout rate

The dropout rate p is one hyperparameter in the case of DNNs and there are to ways the value of p can be tuned:

A) Having p*n as a constant ie., the number of units along all layers remains constant. This means that extremely large layers have a very small value of p and vice versa.

B) The number p remains constant and this is measured based on a cross validation set. Here, we start of with a very small p and we will often find that the model ends up being underfit and due to that we progressively increase the value of p until we receive the required curve in terms of loss and efficiency.

4. Affect of data size

When the datasets are large, the algorithms that are applied along with the regularisations are highly complex and due to this, even the hyperparameters are tuned in the same way. So, when these are applied on smaller datasets, they tend to overfit very highly. On the other hand, when a dropout NN was given datasets of different sizes, what was noticed was that, in case of small datasets, the dropout was able to provide enough noise to the neural networks to prevent overfitting. On the other hand, when the dataset became much larger than it was already trained as, that is when the efficiency started to dip. Thus, dropout NNs solve the problem of overfitting on small datasets.

5. Monte-Carlo Model Averaging vs. Weight Scaling

The efficient test time procedure that we propose is to do an approximate model combination by scaling down the weights of the trained neural network. The best, yet expensive way to sample n neural nets is to use dropout for every test case and take the average of the final predictions in each case. As n → ∞, the model average gets close to the true model average. When we compute the error for different values of k we see that the error rate of the finite-sample average approaches the error rate of the actual model average.

Image for post

Conclusion

Dropout is essentially meant for improving neural networks by mitigating the problem of overfitting. Standard backpropagation learning builds up weak algorithms that work for the training data but do not generalise to new data. When there are random dropouts, we see that particular hidden units become unreliable. Methods that use dropout achieve highly efficient results on datasets like the CIFAR-100 and MNIST. Dropout, in general, improves the performance of standard neural nets on other data sets considerably too. The basic idea behind dropout is to take a large inefficient model and repeatedly sample and train smaller partial models from it to mitigate its tendency to overfit. One drawback is that it increases training time. A dropout network takes 2–3 times longer to train than a standard neural network of the same architecture on an average. A major reason is the generation of excessive noise during parameter updates. At every cycle, it tries to train a different random architecture. Thus, gradients that are being computed are not gradients of the final architecture that will be used at test time. We see a clear trade off between overfitting and training time. By compromising on the training time, a neural network can use a high dropout frequency and experience less overfitting.

To delve into further details, you can find the research paper here: https://jmlr.org/papers/volume15/srivastava14a.old/srivastava14a.pdf

This article was simply a summary of what I picked up from the research paper. I hope you got something out of it!

Spread the word

This post was originally published by Annanya Vedala at Medium [AI]

Related posts