No code introduction to neural networks

towards-data-science

This post was originally published by Philip Wilkinson at Towards Data Science

The simple architecture explained

Neural networks have been around for a long time, being developed in the 1960s as a way to simulate neural activity for the development of artificial intelligence systems. However, since then they have developed into a useful analytical tool often used in replace of, or in conjunction with, standard statistical models such as regression or classification as they can be used to predict or more a specific output. The main difference, and advantage, in this regard is that neural networks make no initial assumptions as to the form of the relationship or distribution that underlies the data, meaning they can be more flexible and capture non-standard and non-linear relationships between input and output variables, making them incredibly valuable in todays data rich environment.

In this sense, their use has took over the past decade or so, with the fall in costs and increase in ability of general computing power, the rise of large datasets allowing these models to be trained, and the development of frameworks such as TensforFlow and Keras that have allowed people with sufficient hardware (in some cases this is no longer even an requirement through cloud computing), the correct data and an understanding of a given coding language to implement them. This article therefore seeks to be provide a no code introduction to their architecture and how they work so that their implementation and benefits can be better understood.

Firstly, the way these models work is that there is an input layer, one or more hidden layers and an output layer, each of which are connected by layers of synaptic weights¹. The input layer (X) is used to take in scaled values of the input, usually within a standardised range of 0–1. The hidden layers (Z) are then used to define the relationship between the input and output using weights and activation functions. The output layer (Y) then transforms the results from the hidden layers into the predicted values, often also scaled to be within 0–1. The synaptic weights (W) connecting these layers are used in model training to determine the weights assigned to each input and prediction in order to get the best model fit. Visually, this is represented as:

The architecture means that each input, xᵢ, to hidden node, z𝒹, is multiplied by a weight wⱼᵢ, and then summed along with an additional bias introduced into the weights matrix, w₀ᵢ which forms the value entering into the hidden node. In the diagram above this is represented by the flow of values from each xᵢ and W₀𝒹 to each z𝒹.

The value at the node is then transformed through a non-linear transfer function, g, which often takes the form a sigmoid function but can take other well behaved forms (bounded, monotonically increasing and differentiable) including tanh and relu functions. The purpose of this is to introduce non-linearities into the network, which allows us to model non-linear relationships within the data.

Then, depending on the number of output nodes, and assuming a single hidden layer, the value from the hidden nodes are multiplied by a weight and summed, along with again an additional hidden bias applied to all values, and passed through an output transfer function to convert it to the final output estimated value². This single layer neural network can then take the form as:

³ where the non-linear activation function here is given by the sigmoid function:

In this case, g, is the same function that is used as a hidden layer transfer function (the inner g) and the output unit transfer function (the outer g), but in some cases these may be different and can depend on the type of model and output desired. Also, while we have added biases into the model through additional weights, W₀, these could be absorbed into the weights matrix so that these additional weights may be dropped, making the equations above simpler:

The fact that we referred this as a single layer neural network hear is in reference to the single hidden layer, but this can often be referred to as a three layer neural network in reference to the input, hidden and output layers or a single hidden layer neural network. The convention is to refer to the number of layers as the number of hidden layers as you will always have an input and output layer as part of the network.

The way that these are trained is through back propagation whereby weights in the hidden layers are adjusted to reflect the error in the model as determined by the relationship between the output and the target value⁴. Thus, this is known as backpropagation as the error is back propagated through the network after each epoch (the number of cycles the model runs for). This error relationship can take a variety of forms, including Binary Cross Entropy Loss and Mean Squared Error (MSE) Loss and is determined by the type of model you are working with. For example, regression is often associated with Mean Square Error Loss, while classification often with Binary Cross Entropy⁵. The latter of these can be represented as:

Where E is the error of the model, tᵢ, is the target and yᵢ is the output values. This is translated back through the weights matrix in the model as the weights are updated by:

where:

with the change in the weights determined by the learning rate, δ, which is a constant set in advance to determine how sensitive the weight changes would be to the error within the model. This adjustment is undertaken until no further convergence is observed and hence a minimum is reached⁴. Thus, setting the learning rate is an important parameter of the model, as if the adjustment is too large then the results can bounce around erratically, not reaching a minima, while if the adjustment is too small then the model could get stuck in a local rather than a global minima. In reality, finding a global minima is hard even with the right learning rate, hence why multiple trial runs are often undertaken with different starting values.

Developing these models then often follows three main stages: selecting an appropriate network architecture, network learning and determining the network parameters, and testing the generalisation of the model⁶. While the first and third part of this process are determined by the exact data used and conventions within the field, the second part is often determined by trail and error to select the correct number of nodes, number of epochs to run and the learning rate⁴. This involves focusing on which set of parameters performs the best on the test or validation set within an expected range of parameters. For example, the number of hidden units within each hidden layer is often set by finding the number that produces the best fit.

The advantage of these models is that they can be applied to a variety of problems to determine relationships that traditional statistical tests could not be applied to, and they can examine non-linear relationships that would otherwise not be discovered. However, these are data hungry models, often requiring thousands if not millions of data points to be accurately trained, and are often described as black box models as we cannot tell how they reached their final results. Thus, in deciding which model to select the individual should ask several key questions⁷:

  1. What are the requirements of accuracy and interpretability
  2. What prior knowledge of the question is there
  3. Can a traditional statistical model be used with sufficient accuracy
  4. What is are the requirements for the design and evaluation of a neural network
Spread the word

This post was originally published by Philip Wilkinson at Towards Data Science

Related posts