# Policy Optimization (PPO)

## In this tutorial, we’ll dive into the understanding of the PPO architecture and we’ll implement a Proximal Policy Optimization (PPO) agent that learns to play Pong-v0

In 2018 OpenAI made a breakthrough in Deep Reinforcement Learning, this was possible only because of solid hardware architecture and using the state of the art’s algorithm: Proximal Policy Optimization.

The main idea of Proximal Policy Optimization is to avoid having too large a policy update. To do that, we use a ratio that tells us the difference between our new and old policy and clip this ratio from 0.8 to 1.2. Doing that will ensure that the policy update will not be too large.

This tutorial will dive into understanding the PPO architecture and implement a Proximal Policy Optimization (PPO) agent that learns to play Pong-v0.

However, if you want to understand PPO, you need first to check all my previous tutorials. In this tutorial, as a backbone, I will use the A3C tutorial code.

Here:

– L(Q) — Policy loss; E — Expected; logπ… — log probability of taking that action at that state, A — Advantage.

The idea of this function is doing gradient ascent step (which is equivalent to taking reversed gradient descent); this way, our agent is pushed to take actions that lead to higher rewards and avoid harmful actions.

However, the problem comes from the step size:

• Too small, the training process is too slow;
• Too high, there is too much variability in training.

That’s where PPO is helpful; the idea is that PPO improves the stability of the Actor training by limiting the policy update at each training step.

To do that, PPO introduced a new objective function called “Clipped surrogate objective function” that will constrain policy change in a small range using a clip.

First, as explained in the PPO paper, instead of using log pi to trace the impact of the actions, PPO uses the ratio between the probability of action under current policy divided by the likelihood of the action under the previous policy:

As we can see, rt(Q) is the probability ratio between the new and old policy:

• If rt(Q) >1, the action is more probable in the current policy than the old one.
• If rt(Q) is between 0 and 1, the action is less likely for the current policy than for the old one.

As a consequence, we can write our new objective function:

However, if the action took more probable in our current policy than in our former, this would lead to a giant policy gradient step and an excessive policy update.

Consequently, we need to constrain this objective function by penalizing changes that lead to a ratio (in the paper, it is said that the ratio can only vary from 0.8 to 1.2). To do that, we have to use the PPO clip probability ratio directly in the objective function with its Clipped surrogate objective function. By doing that, we’ll ensure not to have too large policy update because the new policy can’t be too different from the older one.

With the Clipped Surrogate Objective function, we have two probability ratios, one non-clipped and one clipped in a range (between [1−∈,1+∈], epsilon is a hyperparameter that helps us to define this clip range (in the paper ∈ = 0.2).

If we take the minimum of the clipped and non-clipped objectives, the final aim would be lower (pessimistic bound) than the unclipped goal. Consequently, we have two cases to consider:

• When the advantage is > 0:

If advantage > 0, the action is better than the average of all the actions in that state. Therefore, we should encourage our new policy to increase the probability of taking that action at that state.

This means increasing rt because we increase the probability at new policy and the denominator of old policy stay constant:

However, because of the clip, rt(Q) will only grow to as much as 1+∈, which means this action can’t be a hundred times probable compared to the old policy (because of the clip). This is done because we don’t want to update our policy too much. Taking action at a specific state is only one try. It doesn’t mean that it will always lead to a super positive reward, so we don’t want to be too greedy because it can also lead to bad policy.

In summary, in the case of positive advantage, we want to increase the probability of taking that action at that step, but not too much.

• When the advantage is < 0:

If the advantage < 0, the action should be discouraged because of the negative effect of the outcome. Consequently, rt will decrease (because the action is less probable for the current agent policy than the previous one), but rt will only decrease to as little as 1−∈ because of the clip.

Also, we don’t want to make a big change in the policy by being too greedy by ultimately reducing the probability of taking that action because it leads to an opposing advantage.

In summary, thanks to this clipped surrogate objective, the range that the new policy can vary from the old one is restricted because the incentive for the probability ratio to move outside of the interval is removed. If the ratio is > 1+e or < 1-e the gradient will be equal to 0 (no slope) because the clip has the effect to a gradient. So, both of these clipping regions do not allow us to become too greedy and try to upgrade too much at once and upgrade outside the region where this example is has a good approximation.

The final Clipped Surrogate Objective Loss:

Here:

– c1 and c2 — coefficients; S — denotes an entropy bonus, Lv and Ft — squared-error loss.

So now, we’re ready to implement a PPO agent in A3C style. This means that I will use A3C code as a backbone, and there will be the same processes explained in the A3C tutorial.

So, from the theory side, it looks that it’s challenging to implement, but in practice, we already have prepared code to be adopted to use the PPO loss function; we will need to change our used loss function and defined replay function mainly:

As you can see, we replace the `categorical_crossentropy` loss function with our `ppo_loss` function.

From the `act` function, we will need prediction:

Prediction from the above function is used in replay function:

You may ask, what is this `np.hstack` used for? Because we use a custom loss function, we must send data in the right shape of data. So we pack all advantages, predictions, actions to y_true, and when they are received in the custom loss function, we unpack them.

The last step to do is adding prediction memory to our run function:

We add the same lines to the defined `train_threading` function. That’s all! We have just created an agent that can learn to play any atari game. That’s awesome! You’ll need about 12 to 24 hours of training on 1 GPU to have a good agent with Pong-v0.

So this was a pretty long and exciting tutorial, and here is the complete code:

So same as before, I trained `PongDeterministic-v4` and `Pong-v0`, from all my previous RL tutorials results, were the best. In `PongDeterministic-v4`, we reached the best score in around 400 steps (while A3C needed only 800 steps):

And here are the results for the `Pong-v0` environment :

Results are pretty nice, comparing with other RL algorithms. I am delighted with the results! My goal wasn’t to make it the best, to compare it with the same parameters.

After training it, I thought that this algorithm should learn to play Pong with CNN layers, then my agent would be smaller, and I would be able to upload it to GitHub. So, I changed model architecture in the following way:

And I was right! I trained this agent for 10k steps, and the results were also quite impressive:

Here is the gif, how my agent plays pong:

Don’t forget to implement each part of the code by yourself. It’s essential to understand how everything works. Try to modify the hyperparameters, use another gym environment. Experimenting is the best way to learn, so have fun!

I hope this tutorial could give you a good idea of the basic PPO algorithm. You can now build upon this by executing multiple environments in parallel to collect more training samples and solve more complicated game scenarios.

For now, this is my last Reinforcement Learning tutorial. This was quite a long journey where I learned a lot. I plan to return to reinforcement learning later this year, and this will be a more practical and more exciting tutorial; we will use it in finance!