# Introduction to Deep Learning for Self Driving Cars (Part — 2)

## Foundational Concepts in the field of Deep Learning and Machine Learning

Let’s take our neural networks one level deeper and learn about concepts that every expert knows. Things like activation function, normalization, regularization, and even things like dropouts to make the training more robust so that we can become much more proficient in training neural networks.

## Introduction

In the last medium article referenced below, we’ve trained a simple logistic classifier on images. Now, we’re going to take this classifier and turn it into a deep network. And it’s going to be just a few lines of code, so make sure you understand well what was going on in the previous model.

In the later part, we are going to take a small peak into how our optimizer does all the hard work for us computing gradients for arbitrary functions. And then we are going to look together at the important topic of regularization, which will enable us to train much, much larger models.

## ReLU

Let me introduce the lazy engineer’s favorite non-linear function: the rectified linear units, or RELU for short. RELUs are literally the simplest non-linear functions one can think of. They’re linear if x is greater than 0, and they’re the 0 everywhere else. RELUs have nice derivatives, as well.

When x is less than zero, the value is 0. So, the derivative is 0 as well. When x is greater than 0, the value is equal to x. So, the derivative is equal to 1.

## Network of ReLU’s

Because we’re lazy engineers, we’re going to take something that works our logistic classifier and do the minimal amount of change to make it nonlinear. We’re going to construct our new function in the simplest way that we can think of.

Instead of having a single matrix multiply as our classifier, we’re going to insert a ReLU right in the middle. We now have two matrices, one going from the inputs to the ReLUs and another one connecting the ReLUs to the classifier. We’ve solved two of our problems. Our function is now nonlinear, thanks to the ReLU in the middle, and we now have a new knob that we can tune this number H which corresponds to the number of ReLU units that we have in the classifier. We can make it as big as we want. Congratulations you’ve built your first neural network.

## Backpropagation

Imagine our network is a stack of simple operations; like linear transforms, ReLUs, whatever we want. Some have parameters, like the matrix transforms, some don’t, like the ReLUs. When we apply our data to some input x, we have data flowing through the stack up to our predictions y. To compute the derivatives, we create another graph. The data in that new graph flows backwards through the network, gets combined using the chain rule that we saw before, and produces gradients. That graph can be derived completely automatically from the individual operations in your network. So, most deep learning frameworks will just do it for us. This is called back propagation and it’s a very powerful concept.

It makes computing derivatives of complex function very efficient, as long as the function is made up of simple blocks with simple derivatives. Running the model up to the predictions is often called the forward propagation. And the model that goes backwards is called the back propagation. For every single little batch of our data in our training set, we’re going to run the forward propagation and then the back propagation. And that will give us gradient for each of our weights in your model. Then we’re going to apply those gradients with learning rates to our original weights and update them. And we’re going to repeat that all over again many, many times. This is how our entire model gets optimized. I’m not going to go through more of the math of what’s going on in each of those blocks, because again we don’t typically have to worry about that. And it’s essentially the chain rule.

In particular, each block of the back propagation often takes about twice the memory that’s needed for the forward propagation and twice to compute. That’s important when we want to size our model and fit in memory for example.

## Training a Deep Learning Network

So now we have a small neural network.It’s not particularly deep, just two layers. We can make it bigger, more complex by increasing the size of that hidden layer in the middle, but it turns out that increasing this is not particularly efficient in general. We need to make it very, very big, and then it gets really hard to train. This is where the central idea of deep learning comes into play, instead we can also add more layers and make our model deeper. There are lots of good reasons to do that.

One is parameter efficiency. We can typically get much more performance with pure parameters by going deeper, rather than wider. Another one is that a lot of natural phenomena, that we might be interested in, tend to have a hierarchical structure which deep models naturally capture. If we poke at a model for images, for example, and visualize what the model learns we’ll often find very simple things at the lowest layers, like lines or edges. Once we move up, we tend to see more complicated things like geometric shapes. Going further up, and we start seeing things like objects, faces. This is very powerful, because the model structure matches the kind of abstractions that we might expect to see in our data, and as a result the model has an easier time learning them.

## Regularization

The first way we prevent over fitting, is by looking at the performance
on our validation set. And stopping to train, as soon as we stop improving. It’s called early termination, and it’s still the best way to prevent our network from over optimizing on the training set.

Another way is to apply regularization.

Regularizing means applying artificial constraints on your network, that implicitly reduce the number of free parameters. While not making it more
difficult to optimize. In the skinny jeans analogy, think stretch pants. They fit just as well, but because they’re flexible, they don’t make things harder to fit in 😃 The stretch pants of deep learning are called L2 Regularization. The idea is to add another term to the loss, which penalizes large weights. It’s typically achieved by adding the L2 norm of our weights to the loss, multiplied by a small constant. And yes, yet another hyper parameter
to tune in, sorry about that 😇

## Dropout

There’s another important technique for regularization that only emerged relatively recently and works amazingly well. It also looks insane the first
time we see it, so bear with me. It’s called Dropout. Dropout works like this.

Imagine that we have one layer that connects to another layer. The values that go from one layer to the next are often called activations. Now take those activations and randomly, for every example we train our network on, set half of them to 0. Completely and randomly, we basically take half of the data that’s flowing through our network, and just destroy it and then randomly again. So what happens with dropout? Our network can never rely on any
given activation to be present, because they might be squashed at any given moment. So it is forced to learn a redundant representation for everything to make sure that at least some of the information remains. One activations get smashed, but there is always one or more that do the same job. And that don’t get killed. So, everything remains fine at the end. Forcing our network to learn redundant representations might sound very inefficient. But in practice, it makes things more robust, and prevents over fitting. It also makes our network act as if taking the consensus over an ensemble of networks, which is always a good way to improve performance. Dropout is one of the most important techniques to emerge in the last few years. If Dropout doesn’t work for us, we should probably be using a bigger network.

With this, we have come to the end of this article. Thanks for reading this and following along. Hope you loved it! Bundle of thanks for reading it!