Can Machines Dream?


This post was originally published by Jamie McGowan at Towards Data Science - Medium Tagged

A description and implementation of the theory and concepts that surround Style Transfer (featuring a couple of examples).

Check out my GitHub for a working style transfer codebase.

A question I’m sure you never thought of asking. Unless of course you and a friend have just watched iRobot at 4am.

Nevertheless, here we are… in a world where machines are dreaming away… kind of.

The good news is that you don’t need to be worried. Although the outputs of these algorithms can be very weird (I encourage you to Google search “deep dreams”). They are entirely dependent on the inputs, as is the norm in Machine Learning.

So we’re not developing any evil dreams unless we feed it some pretty nasty stuff to begin with.

The term we use to describe this behaviour is Neural Style Transfer [1].

It involves giving a Neural Network a content image and a style image. Then asking it to recreate the content image in the style of the style image.

The Motivation

One of the things I find fascinating about any Deep Learning concept is the biological roots it has.

Photo by v2osk on Unsplash

To understand how neural style transfer works, we need to understand how convolutional neural networks (CNNs) work. And to understand CNNs, we need to understand how our own visual cortex works.

This sounds like a huge job, but honestly, it’s super simple to understand. And there are lots of words that sound clever and cool so you can impress the best!

The Eye & The Brain

Our eyes are incredible, and what’s even more incredible is our brain. Our brain is the epicentre of our consciousness and sub-consciousness.

As you may already know, the eyes feed information to the brain through our visual cortex which then forms our vision.

But what is really interesting is how the brain builds up our vision.

Photo by Robina Weermeijer on Unsplash

There are countless neurons connecting the eye to our visual cortex. At one end we have the individual cells (rods & cones) that are each responsible for tiny fractions of our vision. While at the other end we have a fully formed and seamless (in most cases) visual perception of the world.

A clue to how our vision was built up was discovered back in 1960’s and 70’s by Hubel and Wiesel [2]. They performed experiments with cats where they showed the cat a bar of light orientated at different angles. When analysing the firing rate or activation of a V1 neuron (a neuron connected to the primary visual cortex). They found that there was a correlation between the firing rate of this neuron and the rotation of the light bar.

This is such an amazing discovery because it shows that different neurons are responsible for detecting different patterns from the visual input we receive from our eyes.

Fig. 1: Diagram of the visual inputs into the brain.

Fig. 1 shows a crude hand-drawn diagram of what I’m trying to explain. Here we have 4 different neurons connected to the brain, each one responds to a particular orientation of a line.

Before we move on, let’s put some Machine Learning terminology into this… These very simple features that the neurons have found are what we call, low-level features. As opposed to high-level features that are more complicated.

So we have some low-level features, which are the basic building blocks of everything we see. You can imagine that there are millions of these features.

Photo by Xavi Cabrera on Unsplash

A bit like when you open a new Lego set and you have what seems like millions of different types of bricks to play with and put together, to build high-level features.

And that’s exactly what our brains do. These neurons are connected to more neurons which have learnt to recognise features built up from its inputs as shown in Fig. 2.

Fig. 2: Diagram showing 2 layers of neurons building up higher level features.

Fig. 2 shows us a very simple example of some of the structures we can build with the low level features.

In terms of our Lego set, we are putting together the first couple of bricks and imagining what we can build!

Fig. 3: Diagram showing how a simple house or boat can be recognised by higher level neurons in the visual pathway.

Pushing this further, we can see how more complicated (or higher level) structures can be built up with each new layer of neurons.

Here we have a couple of rather excellent drawings of a house and a boat. Which our three level visual pathway is now able to recognise.

So when the brain receives an impulse (or action potential) through both of these neurons, it means in our visual perception, there is a house and a boat. Or if it is just one impulse then it is either a house or a boat depending on which one, you get the idea?

With each new layer then, we are allowing our brain to start building up our full vision. Or finish building our Millenium Falcon Lego set, if we go back to the Lego analogy.

From eyes to… CNNs?

Yes, and there is more good news here!

If you followed that last section (hopefully you did) then you already know the basics of how CNNs work!

Go back and have a look at Fig. 3… All we need to do to this figure is add some mathematical notations and reframe it as a neural network in machine learning and viola!

The ins and outs of CNNs are the subject of some extensive research, which we will not go into here. But there are several excellent Medium articles and published papers centred around this subject so I encourage you to search around if you are interested.

For this article, you only need to know that CNNs are usually built up of several convolutional layers, which are responsible for the building of higher level features that we explored above. And then usually some pooling layers which act to reduce the size of the image and force the CNN to look for even higher level features.

Style and Content of an Image

Again from Fig. 3, where do you think the Content and the Style resides inside this very basic neural net?

The content is the more obvious one to think about here, since the higher level features such as the house and the boat are most likely to be representing the content of an image. As opposed to a single line or some very low level details.

Photo by Steve Johnson on Unsplash

On the flip side of this then, let’s think about what the style of an image is… Imagine I just painted an absolute masterpiece, much like the one seen here…

One of the large contributors to the style of this image is the brush strokes.

Brush strokes can be large features of the image however they are relatively low level features when compared to the content of the image.

So we can imagine that more of the style representation of the image will reside in the left side of Fig. 3. Or the lower levels of each convolutional layer in our CNN.

Now let’s take a look at an actual CNN architecture.

Fig. 4: VGG-19 architecture from [2].

Fig. 4 shows the architecture of the famous VGG-19 model developed in [3]. The inputs are fed in from the left and the output given back to us on the right. In the middle we have all our neurons that combine together to recognise low to high-level features.

At each convolutional layer (shaded green in Fig. 4) we repeat the process described earlier. Searching for simple features in the first dimension of the convolutional layer (low-level) up to more complex features in the last dimension of the convolutional layer (high-level). Then we have a max pool layer which reduces the size of the image and we repeat this process again.

Photo by Siora Photography on Unsplash

These max pooling layers are very important, because as they reduce the size of the image, we lose the finer details of the image. So at each step we are naturally looking at larger features.

These max pooling layers are very important, because as they reduce the size of the image, we lose the finer details of the image. So at each step we are naturally looking at larger features.

We also notice that the depth of each convolutional layer increases from left to right. This is because we are finding more complex features on the larger scale which are more important to the image content than in the lower levels. For example, a really complex brush stroke or pencil mark is less important than the overall shape that has been drawn.

Now obviously every image is different and so it is not easy to define a specific layer each layer is different and it is not easy to define a specific point in a CNN that best defines the content of an image. However as a rule of thumb we don’t usually expect the content to be on the very small scale, or on the very large scale. Somewhere in between is usually a good starting point.

With regards to the style, this can be manifested at any scale and so it is a good idea to have some weighting at each convolutional layer. However, as we have hopefully justified, it is also a good idea to chose you style layers earlier on in the convolutional layers before they start to become more complex and more content based representations.

Generating an image

To generate an image, we must first choose the layers which will contain our content and style. We will choose ours as:

content = "conv4_2"style = {
        "conv1_1": 1.0,
        "conv2_1": 0.75,
        "conv3_1": 0.2,
        "conv4_1": 0.2,
        "conv5_1": 0.2

Note that these correspond to the labelling in Fig. 4.

The numbers associated with the style layers are the weights that we wish to give to each layer. So we want the style to be determined more from the conv1_1 layer, with the lowest level features, than any other layer. But again, this is something to play around with depending on your inputs and preferences.

Photo by Clément Hélardot on Unsplash

Now we will pass our content and style images separately into our CNN. As we do this, we can find the base representations for the content layer conv4_2 and each of the style layers for the content and style image respectively.

Now we have something to aim for!

Basically we want to have an image that passes through the CNN and reproduces the same style representations we found for our chosen style layers when passing through the style image. And then also the same content representations we found for our chosen content layer when passing through the content image.

This is where the fun begins…

First, we need a starting point for our image. Since we are most likely trying to have a final product that looks more like our content image but with a different style, we will go ahead and make a copy our content image for our initial input.

At this point we are diverting slightly from the usual Machine Learning set up you may be used to. We are not actually training the model here, we are actually training the image.

So instead of having our backpropagation graph linked to the weights of the CNN, we are actually linking it to the weights of the input image (the individual pixels).

This way, when we find the content and style loss between the chosen features found from the input image and the base features we prepared earlier. Performing backpropagation will cause the input image to change slightly, not the CNN.

This is not an easy thing to explain so I encourage you to take a few minutes and look through the code example on my GitHub, specifically in the algorithms/ file.

What have we got to lose? — Bit more mathematical…

All we need now is a loss function.

We’ll start with the content loss since this is slightly easier to explain as it is just the mean squared error between the feature maps from “conv4_2” with the base content image (C) and the new image (X).

Fig 5: Content Loss Equation

This is simply just penalising any divergence away from the features found in the base representation for the content image.

Now for the style of the image, we are going to need to briefly introduce Gram Matrices. These sound a lot more complicated than they actually are so for the purposes of this article, I will simply describe them as higher dimensional dot products.

The intuition is that for style, we don’t really want to recreate the exact same brush stroke in the exact same place on our image. We simply want to have a similar style throughout the image. It is therefore better to compute the loss in terms of how similar the style representations are, just like a cosine similarity or dot product does.

Of course we must remember that we are using feature maps, which are at a higher dimension than a usual dot product. So we actually need to use a Gram Matrix to encapsulate the correlations between the style of the input X and the base style image S.

Fig. 6: Style Loss Equation.

Fig. 6 shows this encapsulated into one formula. Where the MSE loss of the gram matrices for each chosen style layer are multiplied by their weights (defined in the code block above) and then summed up.

Finally we arrive at a total loss function,

which is characterised by the alpha and beta hyperparameters multiplying the loss. These determine the extent of how much the style will be imposed into the final image and are left for us to define as we wish.


Once we have all these features, we are ready to train and the results are incredible!

Here are a couple of space themed examples…

The code on GitHub should be pretty easy to use if you wish to try this out. Simply add your own images into the content and style folder and then run the code with the new file names.

As always, let me know of any issues or comments!


[1] — L. A. Gatys, A. S. Ecker and M. Bethge, A Neural Algorithm of Artistic Style, (2015)

[2] — D. H. Hubel and T. N. Wiesel, Receptive fields, binocular interaction and functional architecture in the cat’s visual cortex, (1962)

[3] — K. Simonyan, A. Zisserman, Very Deep Convolutional Networks for Large-Scale Image Recognition, (2015)

Spread the word

This post was originally published by Jamie McGowan at Towards Data Science - Medium Tagged

Related posts