Pendulum with PPO¶
In this notebook we solve the Pendulum-v0 environment using a TD actor-critic algorithm with PPO policy updates.
We use a simple multi-layer percentron as our function
approximators for the state value function \(v(s)\) and
policy \(\pi(a|s)\) implemented by GaussianPolicy.
This algorithm is slow to converge (if it does at all). You should start to see improvement in the average return after about 150k timesteps. Below you’ll see a particularly succesful episode:
To view the notebook in a new tab, click here. To interact with the notebook in Google Colab, hit the “Open in Colab” button below.