Pendulum with PPO

In this notebook we solve the Pendulum-v0 environment using a TD actor-critic algorithm with PPO policy updates.

We use a simple multi-layer percentron as our function approximators for the state value function \(v(s)\) and policy \(\pi(a|s)\) implemented by GaussianPolicy.

This algorithm is slow to converge (if it does at all). You should start to see improvement in the average return after about 150k timesteps. Below you’ll see a particularly succesful episode:

A particularly succesful episode of Pendulum.

To view the notebook in a new tab, click here. To interact with the notebook in Google Colab, hit the “Open in Colab” button below.

Open in Colab