keras-gym

Plug-n-play Reinforcement Learning in Python

Create simple, reproducible RL solutions with OpenAI gym environments and Keras function approximators.

Documentation

Example Notebooks

Here we list a selection of Jupyter notebooks that help you to get started by learning by example.

Cartpole

In these notebooks we solve the CartPole environment.

Cartpole with SARSA

In this notebook we solve the CartPole-v0 environment using the SARSA algorithm. We’ll use a linear function approximator for our Q-function.

To view the notebook in a new tab, click here. To interact with the notebook in Google Colab, hit the “Open in Colab” button below.

Open in Colab

Atari 2600: Pong

These notebooks solve the Pong environment.

Atari 2600: Pong with DQN

In this notebook we solve the PongDeterministic-v4 environment using deep Q-learning (DQN). We’ll use a convolutional neural net (without pooling) as our function approximator for the Q-function, see AtariQ.

This notebook periodically generates GIFs, so that we can inspect how the training is progressing.

After a few hundred episodes, this is what you can expect:

Beating Atari 2600 Pong after a few hundred episodes.

To view the notebook in a new tab, click here. To interact with the notebook in Google Colab, hit the “Open in Colab” button below.

Open in Colab
Atari 2600: Pong with PPO

In this notebook we solve the PongDeterministic-v4 environment using a TD actor-critic algorithm with PPO policy updates.

We use convolutional neural nets (without pooling) as our function approximators for the state value function \(v(s)\) and policy \(\pi(a|s)\), see AtariFunctionApproximator.

This notebook periodically generates GIFs, so that we can inspect how the training is progressing.

After a few hundred episodes, this is what you can expect:

Beating Atari 2600 Pong after a few hundred episodes.

To view the notebook in a new tab, click here. To interact with the notebook in Google Colab, hit the “Open in Colab” button below.

Open in Colab

Non-Slippery Frozen Lake

In these notebooks we solve a non-slippery version of the FrozenLake-v0 environment.

Non-Slippery Frozen Lake with REINFORCE

In this notebook we solve a non-slippery version of the FrozenLake-v0 environment using the REINFORCE algorithm (Monte Carlo policy gradient). We’ll use a linear function approximator for our policy.

To view the notebook in a new tab, click here. To interact with the notebook in Google Colab, hit the “Open in Colab” button below.

Open in Colab
Non-Slippery Frozen Lake with Actor-Critic

In this notebook we solve a non-slippery version of the FrozenLake-v0 environment using the TD actor critic algorithm with PPO policy updates. We’ll use a linear function approximator for our policy and our state value function.

To view the notebook in a new tab, click here. To interact with the notebook in Google Colab, hit the “Open in Colab” button below.

Open in Colab
Non-Slippery Frozen Lake with Soft Actor-Critic (SAC)

In this notebook we solve a non-slippery version of the FrozenLake-v0 environment using the Soft Actor-Critic algorithm (SAC). We’ll use a linear function approximator for our policy and our value functions.

To view the notebook in a new tab, click here. To interact with the notebook in Google Colab, hit the “Open in Colab” button below.

Open in Colab

Pendulum

These notebooks solve the Pendulum environment.

Pendulum with PPO

In this notebook we solve the Pendulum-v0 environment using a TD actor-critic algorithm with PPO policy updates.

We use a simple multi-layer percentron as our function approximators for the state value function \(v(s)\) and policy \(\pi(a|s)\) implemented by GaussianPolicy.

This algorithm is slow to converge (if it does at all). You should start to see improvement in the average return after about 150k timesteps. Below you’ll see a particularly succesful episode:

A particularly succesful episode of Pendulum.

To view the notebook in a new tab, click here. To interact with the notebook in Google Colab, hit the “Open in Colab” button below.

Open in Colab

Function Approximators

The central object object in this package is the keras_gym.FunctionApproximator, which provides an interface between a gym-type environment and function approximators like value functions and updateable policies.

FunctionApproximator class

The way we would define a function approximator is by specifying a body. For instance, the example below specifies a simple multi-layer perceptron:

import gym
import keras_gym as km
from tensorflow import keras


class MLP(km.FunctionApproximator):
    """ multi-layer perceptron with one hidden layer """
    def body(self, S):
        X = keras.layers.Flatten()(S)
        X = keras.layers.Dense(units=4)(X)
        return X


# environment
env = gym.make(...)

# value function and its derived policy
function_approximator = MLP(env, lr=0.01)

This function_approximator can now be used to construct a value function or updateable policy, which we cover in the remainder of this page.

Predefined Function Approximators

Although it’s pretty easy to create a custom function approximator, keras-gym also provides some predefined function approximators. They are listed here.

Value Functions

Value functions estimate the expected (discounted) sum of future rewards. For instance, state value functions are defined as:

\[v(s)\ =\ \mathbb{E}_t\left\{ R_t + \gamma R_{t+1} + \gamma^2 R_{t+2} + \dots\ \Big|\ S_t=s \right\}\]

Here, the \(R\) are the individual rewards we receive from the Markov Decision Process (MDP) at each time step.

In keras-gym we define a state value functions as follows:

v = km.V(function_approximator, gamma=0.9, bootstrap_n=1)

The function_approximator is discussed above. The other arguments set the discount factor \(\gamma\in[0,1]\) and the number of steps over which to bootstrap.

Similar to state value functions, we can also define state-action value functions:

\[q(s, a)\ =\ \mathbb{E}_t\left\{ R_t + \gamma R_{t+1} + \gamma^2 R_{t+2} + \dots\ \Big|\ S_t=s, A_t=a \right\}\]

keras-gym provides two distinct ways to define such a Q-function, which are refered to as type-I and type-II Q-functions. The difference between the two is in how the function approximator models the Q-function. A type-I Q-function models the Q-function as \((s, a)\mapsto q(s, a)\in\mathbb{R}\), whereas a type-II Q-function models it as \(s\mapsto q(s,.)\in\mathbb{R}^n\). Here, \(n\) is the number of actions, which means that this is only well-defined for discrete action spaces.

In keras-gym we define a type-I Q-function as follows:

q = km.QTypeI(function_approximator, update_strategy='sarsa')

and similarly for type-II:

q = km.QTypeII(function_approximator, update_strategy='sarsa')

The update_strategy argument specifies our bootstrapping target. Available choices are 'sarsa', 'q_learning' and 'double_q_learning'.

The main reason for using a Q-function is for value-based control. In other words, we typically want to derive a policy from the Q-function. This is pretty straightforward too:

pi = km.EpsilonGreedy(q, epsilon=0.1)

# the epsilon parameter may be updated dynamically
pi.set_epsilon(0.25)

Updateable Policies

Besides value-based control in which we derive a policy from a Q-function, we can also do policy-based control. In policy-based methods we learn a policy directly as a probability distribution over the space of actions \(\pi(a|s)\).

The updateable policies for discrete action spaces are known as softmax policies:

\[\pi(a|s)\ =\ \frac{\exp z(s,a)}{\sum_{a'}\exp z(s,a')}\]

where the logits are defined over the real line \(z(s,a)\in\mathbb{R}\).

In keras-gym we define a softmax policy as follows:

pi = km.SoftmaxPolicy(function_approximator, update_strategy='vanilla')

Similar to Q-functions, we can pick different update strategies. Available options for policies are 'vanilla', 'ppo' and 'cross_entropy'. These specify the objective function used in our policy updates.

Actor-Critics

It’s often useful to combine a policy with a value function into what is called an actor-critic. The value function (critic) can be used to aid the update procedure for the policy (actor). The keras-gym package provides simple way of constructing an actor-critic using the ActorCritic class:

# separate policy and value function
pi = km.SoftmaxPolicy(function_approximator, update_strategy='vanilla')
v = km.V(function_approximator, gamma=0.9, bootstrap_n=1)

# combine them into a single actor-critic
actor_critic = km.ActorCritic(pi, v)

Objects

FunctionApproximator class
keras_gym.FunctionApproximator A generic function approximator.
class keras_gym.FunctionApproximator(env, optimizer=None, **optimizer_kwargs)[source]

A generic function approximator.

This is the central object object that provides an interface between a gym-type environment and function approximators like value functions and updateable policies.

In order to create a valid function approximator, you need to implement the body method. For example, to implement a simple multi-layer perceptron function approximator you would do something like:

import gym
import keras_gym as km
from tensorflow.keras.layers import Flatten, Dense

class MLP(km.FunctionApproximator):
    """ multi-layer perceptron with one hidden layer """
    def body(self, S):
        X = Flatten()(S)
        X = Dense(units=4)(X)
        return X

# environment
env = gym.make(...)

# generic function approximator
mlp = MLP(env, lr=0.001)

# policy and value function
pi, v = km.SoftmaxPolicy(mlp), km.V(mlp)

The default heads are simple (multi) linear regression layers, which can be overridden by your own implementation.

Parameters:
env : environment

A gym-style environment.

optimizer : keras.optimizers.Optimizer, optional

If left unspecified (optimizer=None), the function approximator’s DEFAULT_OPTIMIZER is used. See keras documentation for more details.

**optimizer_kwargs : keyword arguments

Keyword arguments for the optimizer. This is useful when you want to use the default optimizer with a different setting, e.g. changing the learning rate.

DEFAULT_OPTIMIZER

alias of tensorflow.python.keras.optimizer_v2.adam.Adam

body(S)[source]

This is the part of the computation graph that may be shared between e.g. policy (actor) and value function (critic). It is typically the part of a neural net that does most of the heavy lifting. One may think of the body() as an elaborate automatic feature extractor.

Parameters:
S : nd Tensor: shape: [batch_size, …]

The input state observation.

Returns:
X : nd Tensor, shape: [batch_size, …]

The intermediate keras tensor.

body_q1(S, A)[source]

This is similar to body(), except that it takes a state-action pair as input instead of only state observations.

Parameters:
S : nd Tensor: shape: [batch_size, …]

The input state observation.

A : nd Tensor: shape: [batch_size, …]

The input actions.

Returns:
X : nd Tensor, shape: [batch_size, …]

The intermediate keras tensor.

head_pi(X)[source]

This is the policy head. It returns logits, i.e. not probabilities. Use a softmax to turn the output into probabilities.

Parameters:
X : nd Tensor, shape: [batch_size, …]

X is an intermediate tensor in the full forward-pass of the computation graph; it’s the output of the last layer of the body() method.

Returns:
*params : Tensor or tuple of Tensors, shape: [batch_size, …]

These constitute the raw policy distribution parameters.

head_q1(X)[source]

This is the type-I Q-value head. It returns a scalar Q-value \(q(s,a)\in\mathbb{R}\).

Parameters:
X : nd Tensor, shape: [batch_size, …]

X is an intermediate tensor in the full forward-pass of the computation graph; it’s the output of the last layer of the body() method.

Returns:
Q_sa : 2d Tensor, shape: [batch_size, 1]

The output type-I Q-values \(q(s,a)\in\mathbb{R}\).

head_q2(X)[source]

This is the type-II Q-value head. It returns a vector of Q-values \(q(s,.)\in\mathbb{R}^n\).

Parameters:
X : nd Tensor, shape: [batch_size, …]

X is an intermediate tensor in the full forward-pass of the computation graph; it’s the output of the last layer of the body() method.

Returns:
Q_s : 2d Tensor, shape: [batch_size, num_actions]

The output type-II Q-values \(q(s,.)\in\mathbb{R}^n\).

head_v(X)[source]

This is the state value head. It returns a scalar V-value \(v(s)\in\mathbb{R}\).

Parameters:
X : nd Tensor, shape: [batch_size, …]

X is an intermediate tensor in the full forward-pass of the computation graph; it’s the output of the last layer of the body() method.

Returns:
V : 2d Tensor, shape: [batch_size, 1]

The output state values \(v(s)\in\mathbb{R}\).

Predefined Function Approximators
keras_gym.predefined.LinearFunctionApproximator A linear function approximator.
keras_gym.predefined.AtariFunctionApproximator A function approximator specifically designed for Atari 2600 environments.
keras_gym.predefined.ConnectFourFunctionApproximator A function approximator specifically designed for the ConnectFour environment.
class keras_gym.predefined.LinearFunctionApproximator(env, interaction=None, optimizer=None, **optimizer_kwargs)

A linear function approximator.

Parameters:
env : environment

A gym-style environment.

interaction : str or keras.layers.Layer, optional

The desired feature interactions that are fed to the linear regression model. Available predefined preprocessors can be chosen by passing a string, one of the following:

‘full_quadratic’

This option generates full-quadratic interactions, which include all linear, bilinear and quadratic terms. It does not include an intercept. Let \(b\) and \(n\) be the batch size and number of features. Then, the input shape is \((b, n)\) and the output shape is \((b, (n + 1) (n + 2) / 2 - 1))\).

Note: This option requires the tensorflow backend.

‘elementwise_quadratic’

This option generates element-wise quadratic interactions, which only include linear and quadratic terms. It does not include bilinear terms or an intercept. Let \(b\) and \(n\) be the batch size and number of features. Then, the input shape is \((b, n)\) and the output shape is \((b, 2n)\).

Otherwise, a custom interaction layer can be passed as well. If left unspecified (interaction=None), the interaction layer is omitted altogether.

optimizer : keras.optimizers.Optimizer, optional

If left unspecified (optimizer=None), the function approximator’s DEFAULT_OPTIMIZER is used. See keras documentation for more details.

**optimizer_kwargs : keyword arguments

Keyword arguments for the optimizer. This is useful when you want to use the default optimizer with a different setting, e.g. changing the learning rate.

DEFAULT_OPTIMIZER

alias of tensorflow.python.keras.optimizer_v2.gradient_descent.SGD

body(S)

This is the part of the computation graph that may be shared between e.g. policy (actor) and value function (critic). It is typically the part of a neural net that does most of the heavy lifting. One may think of the body() as an elaborate automatic feature extractor.

Parameters:
S : nd Tensor: shape: [batch_size, …]

The input state observation.

Returns:
X : nd Tensor, shape: [batch_size, …]

The intermediate keras tensor.

body_q1(S, A)

This is similar to body(), except that it takes a state-action pair as input instead of only state observations.

Parameters:
S : nd Tensor: shape: [batch_size, …]

The input state observation.

A : nd Tensor: shape: [batch_size, …]

The input actions.

Returns:
X : nd Tensor, shape: [batch_size, …]

The intermediate keras tensor.

head_pi(X)

This is the policy head. It returns logits, i.e. not probabilities. Use a softmax to turn the output into probabilities.

Parameters:
X : nd Tensor, shape: [batch_size, …]

X is an intermediate tensor in the full forward-pass of the computation graph; it’s the output of the last layer of the body() method.

Returns:
*params : Tensor or tuple of Tensors, shape: [batch_size, …]

These constitute the raw policy distribution parameters.

head_q1(X)

This is the type-I Q-value head. It returns a scalar Q-value \(q(s,a)\in\mathbb{R}\).

Parameters:
X : nd Tensor, shape: [batch_size, …]

X is an intermediate tensor in the full forward-pass of the computation graph; it’s the output of the last layer of the body() method.

Returns:
Q_sa : 2d Tensor, shape: [batch_size, 1]

The output type-I Q-values \(q(s,a)\in\mathbb{R}\).

head_q2(X)

This is the type-II Q-value head. It returns a vector of Q-values \(q(s,.)\in\mathbb{R}^n\).

Parameters:
X : nd Tensor, shape: [batch_size, …]

X is an intermediate tensor in the full forward-pass of the computation graph; it’s the output of the last layer of the body() method.

Returns:
Q_s : 2d Tensor, shape: [batch_size, num_actions]

The output type-II Q-values \(q(s,.)\in\mathbb{R}^n\).

head_v(X)

This is the state value head. It returns a scalar V-value \(v(s)\in\mathbb{R}\).

Parameters:
X : nd Tensor, shape: [batch_size, …]

X is an intermediate tensor in the full forward-pass of the computation graph; it’s the output of the last layer of the body() method.

Returns:
V : 2d Tensor, shape: [batch_size, 1]

The output state values \(v(s)\in\mathbb{R}\).

class keras_gym.predefined.AtariFunctionApproximator(env, optimizer=None, **optimizer_kwargs)

A function approximator specifically designed for Atari 2600 environments.

Parameters:
env : environment

An Atari 2600 gym environment.

optimizer : keras.optimizers.Optimizer, optional

If left unspecified (optimizer=None), the function approximator’s DEFAULT_OPTIMIZER is used. See keras documentation for more details.

**optimizer_kwargs : keyword arguments

Keyword arguments for the optimizer. This is useful when you want to use the default optimizer with a different setting, e.g. changing the learning rate.

DEFAULT_OPTIMIZER

alias of tensorflow.python.keras.optimizer_v2.adam.Adam

body(S)

This is the part of the computation graph that may be shared between e.g. policy (actor) and value function (critic). It is typically the part of a neural net that does most of the heavy lifting. One may think of the body() as an elaborate automatic feature extractor.

Parameters:
S : nd Tensor: shape: [batch_size, …]

The input state observation.

Returns:
X : nd Tensor, shape: [batch_size, …]

The intermediate keras tensor.

body_q1(S, A)

This is similar to body(), except that it takes a state-action pair as input instead of only state observations.

Parameters:
S : nd Tensor: shape: [batch_size, …]

The input state observation.

A : nd Tensor: shape: [batch_size, …]

The input actions.

Returns:
X : nd Tensor, shape: [batch_size, …]

The intermediate keras tensor.

head_pi(X)

This is the policy head. It returns logits, i.e. not probabilities. Use a softmax to turn the output into probabilities.

Parameters:
X : nd Tensor, shape: [batch_size, …]

X is an intermediate tensor in the full forward-pass of the computation graph; it’s the output of the last layer of the body() method.

Returns:
*params : Tensor or tuple of Tensors, shape: [batch_size, …]

These constitute the raw policy distribution parameters.

head_q1(X)

This is the type-I Q-value head. It returns a scalar Q-value \(q(s,a)\in\mathbb{R}\).

Parameters:
X : nd Tensor, shape: [batch_size, …]

X is an intermediate tensor in the full forward-pass of the computation graph; it’s the output of the last layer of the body() method.

Returns:
Q_sa : 2d Tensor, shape: [batch_size, 1]

The output type-I Q-values \(q(s,a)\in\mathbb{R}\).

head_q2(X)

This is the type-II Q-value head. It returns a vector of Q-values \(q(s,.)\in\mathbb{R}^n\).

Parameters:
X : nd Tensor, shape: [batch_size, …]

X is an intermediate tensor in the full forward-pass of the computation graph; it’s the output of the last layer of the body() method.

Returns:
Q_s : 2d Tensor, shape: [batch_size, num_actions]

The output type-II Q-values \(q(s,.)\in\mathbb{R}^n\).

head_v(X)

This is the state value head. It returns a scalar V-value \(v(s)\in\mathbb{R}\).

Parameters:
X : nd Tensor, shape: [batch_size, …]

X is an intermediate tensor in the full forward-pass of the computation graph; it’s the output of the last layer of the body() method.

Returns:
V : 2d Tensor, shape: [batch_size, 1]

The output state values \(v(s)\in\mathbb{R}\).

class keras_gym.predefined.ConnectFourFunctionApproximator(env, optimizer=None, **optimizer_kwargs)

A function approximator specifically designed for the ConnectFour environment.

Parameters:
env : environment

An Atari 2600 gym environment.

optimizer : keras.optimizers.Optimizer, optional

If left unspecified (optimizer=None), the function approximator’s DEFAULT_OPTIMIZER is used. See keras documentation for more details.

**optimizer_kwargs : keyword arguments

Keyword arguments for the optimizer. This is useful when you want to use the default optimizer with a different setting, e.g. changing the learning rate.

DEFAULT_OPTIMIZER

alias of tensorflow.python.keras.optimizer_v2.adam.Adam

body(S)

This is the part of the computation graph that may be shared between e.g. policy (actor) and value function (critic). It is typically the part of a neural net that does most of the heavy lifting. One may think of the body() as an elaborate automatic feature extractor.

Parameters:
S : nd Tensor: shape: [batch_size, …]

The input state observation.

Returns:
X : nd Tensor, shape: [batch_size, …]

The intermediate keras tensor.

body_q1(S, A)

This is similar to body(), except that it takes a state-action pair as input instead of only state observations.

Parameters:
S : nd Tensor: shape: [batch_size, …]

The input state observation.

A : nd Tensor: shape: [batch_size, …]

The input actions.

Returns:
X : nd Tensor, shape: [batch_size, …]

The intermediate keras tensor.

head_pi(X)

This is the policy head. It returns logits, i.e. not probabilities. Use a softmax to turn the output into probabilities.

Parameters:
X : nd Tensor, shape: [batch_size, …]

X is an intermediate tensor in the full forward-pass of the computation graph; it’s the output of the last layer of the body() method.

Returns:
*params : Tensor or tuple of Tensors, shape: [batch_size, …]

These constitute the raw policy distribution parameters.

head_q1(X)

This is the type-I Q-value head. It returns a scalar Q-value \(q(s,a)\in\mathbb{R}\).

Parameters:
X : nd Tensor, shape: [batch_size, …]

X is an intermediate tensor in the full forward-pass of the computation graph; it’s the output of the last layer of the body() method.

Returns:
Q_sa : 2d Tensor, shape: [batch_size, 1]

The output type-I Q-values \(q(s,a)\in\mathbb{R}\).

head_q2(X)

This is the type-II Q-value head. It returns a vector of Q-values \(q(s,.)\in\mathbb{R}^n\).

Parameters:
X : nd Tensor, shape: [batch_size, …]

X is an intermediate tensor in the full forward-pass of the computation graph; it’s the output of the last layer of the body() method.

Returns:
Q_s : 2d Tensor, shape: [batch_size, num_actions]

The output type-II Q-values \(q(s,.)\in\mathbb{R}^n\).

head_v(X)

This is the state value head. It returns a scalar V-value \(v(s)\in\mathbb{R}\).

Parameters:
X : nd Tensor, shape: [batch_size, …]

X is an intermediate tensor in the full forward-pass of the computation graph; it’s the output of the last layer of the body() method.

Returns:
V : 2d Tensor, shape: [batch_size, 1]

The output state values \(v(s)\in\mathbb{R}\).

Value Functions
keras_gym.V A state value function \(s\mapsto v(s)\).
keras_gym.QTypeI A type-I state-action value function \((s,a)\mapsto q(s,a)\).
keras_gym.QTypeII A type-II state-action value function \(s\mapsto q(s,.)\).
class keras_gym.V(function_approximator, gamma=0.9, bootstrap_n=1, bootstrap_with_target_model=False)[source]

A state value function \(s\mapsto v(s)\).

Parameters:
function_approximator : FunctionApproximator object

The main function approximator.

gamma : float, optional

The discount factor for discounting future rewards.

bootstrap_n : positive int, optional

The number of steps in n-step bootstrapping. It specifies the number of steps over which we’re willing to delay bootstrapping. Large \(n\) corresponds to Monte Carlo updates and \(n=1\) corresponds to TD(0).

bootstrap_with_target_model : bool, optional

Whether to use the target_model when constructing a bootstrapped target. If False (default), the primary predict_model is used.

__call__(s, use_target_model=False)[source]

Evaluate the Q-function.

Parameters:
s : state observation

A single state observation.

use_target_model : bool, optional

Whether to use the target_model internally. If False (default), the predict_model is used.

Returns:
V : float or array of floats

The estimated value of the state \(v(s)\).

batch_eval(S, use_target_model=False)[source]

Evaluate the state value function on a batch of state observations.

Parameters:
S : nd array, shape: [batch_size, …]

A batch of state observations.

use_target_model : bool, optional

Whether to use the target_model internally. If False (default), the predict_model is used.

Returns:
V : 1d array, dtype: float, shape: [batch_size]

The predicted state values.

batch_update(S, Rn, In, S_next)[source]

Update the value function on a batch of transitions.

Parameters:
S : nd array, shape: [batch_size, …]

A batch of state observations.

Rn : 1d array, dtype: float, shape: [batch_size]

A batch of partial returns. For example, in n-step bootstrapping this is given by:

\[R^{(n)}_t\ =\ R_t + \gamma\,R_{t+1} + \dots \gamma^{n-1}\,R_{t+n-1}\]

In other words, it’s the non-bootstrapped part of the n-step return.

In : 1d array, dtype: float, shape: [batch_size]

A batch bootstrapping factor. For instance, in n-step bootstrapping this is given by \(I^{(n)}_t=\gamma^n\) if the episode is ongoing and \(I^{(n)}_t=0\) otherwise. This allows us to write the bootstrapped target as:

\[G^{(n)}_t=R^{(n)}_t+I^{(n)}_tQ(S_{t+n}, A_{t+n})\]
S_next : nd array, shape: [batch_size, …]

A batch of next-state observations.

Returns:
losses : dict

A dict of losses/metrics, of type {name <str>: value <float>}.

sync_target_model(tau=1.0)

Synchronize the target model with the primary model.

Parameters:
tau : float between 0 and 1, optional

The amount of exponential smoothing to apply in the target update:

\[w_\text{target}\ \leftarrow\ (1 - \tau)\,w_\text{target} + \tau\,w_\text{primary}\]
update(s, r, done)[source]

Update the Q-function.

Parameters:
s : state observation

A single state observation..

r : float

A single observed reward.

done : bool

Whether the episode has finished.

class keras_gym.QTypeI(function_approximator, gamma=0.9, bootstrap_n=1, bootstrap_with_target_model=False, update_strategy='sarsa')[source]

A type-I state-action value function \((s,a)\mapsto q(s,a)\).

Parameters:
function_approximator : FunctionApproximator object

The main function approximator.

gamma : float, optional

The discount factor for discounting future rewards.

bootstrap_n : positive int, optional

The number of steps in n-step bootstrapping. It specifies the number of steps over which we’re willing to delay bootstrapping. Large \(n\) corresponds to Monte Carlo updates and \(n=1\) corresponds to TD(0).

bootstrap_with_target_model : bool, optional

Whether to use the target_model when constructing a bootstrapped target. If False (default), the primary predict_model is used.

update_strategy : str, optional

The update strategy that we use to select the (would-be) next-action \(A_{t+n}\) in the bootstrapped target:

\[G^{(n)}_t\ =\ R^{(n)}_t + \gamma^n Q(S_{t+n}, A_{t+n})\]

Options are:

‘sarsa’

Sample the next action, i.e. use the action that was actually taken.

‘q_learning’

Take the action with highest Q-value under the current estimate, i.e. \(A_{t+n} = \arg\max_aQ(S_{t+n}, a)\). This is an off-policy method.

‘double_q_learning’

Same as ‘q_learning’, \(A_{t+n} = \arg\max_aQ(S_{t+n}, a)\), except that the value itself is computed using the target_model rather than the primary model, i.e.

\[\begin{split}A_{t+n}\ &=\ \arg\max_aQ_\text{primary}(S_{t+n}, a)\\ G^{(n)}_t\ &=\ R^{(n)}_t + \gamma^n Q_\text{target}(S_{t+n}, A_{t+n})\end{split}\]
‘expected_sarsa’

Similar to SARSA in that it’s on-policy, except that we take the expectated Q-value rather than a sample of it, i.e.

\[G^{(n)}_t\ =\ R^{(n)}_t + \gamma^n\sum_a\pi(a|s)\,Q(S_{t+n}, a)\]
__call__(s, a=None, use_target_model=False)

Evaluate the Q-function.

Parameters:
s : state observation

A single state observation.

a : action, optional

A single action.

use_target_model : bool, optional

Whether to use the target_model internally. If False (default), the predict_model is used.

Returns:
Q : float or array of floats

If action a is provided, a single float representing \(q(s,a)\) is returned. If, on the other hand, a is left unspecified, a vector representing \(q(s,.)\) is returned instead. The shape of the latter return value is [num_actions], which is only well-defined for discrete action spaces.

batch_eval(S, A=None, use_target_model=False)[source]

Evaluate the Q-function on a batch of state (or state-action) observations.

Parameters:
S : nd array, shape: [batch_size, …]

A batch of state observations.

A : 1d array, dtype: int, shape: [batch_size], optional

A batch of actions that were taken.

use_target_model : bool, optional

Whether to use the target_model internally. If False (default), the predict_model is used.

Returns:
Q : 1d or 2d array of floats

If action A is provided, a 1d array representing a batch of \(q(s,a)\) is returned. If, on the other hand, A is left unspecified, a vector representing a batch of \(q(s,.)\) is returned instead. The shape of the latter return value is [batch_size, num_actions], which is only well-defined for discrete action spaces.

batch_update(S, A, Rn, In, S_next, A_next=None)

Update the value function on a batch of transitions.

Parameters:
S : nd array, shape: [batch_size, …]

A batch of state observations.

A : nd Tensor, shape: [batch_size, …]

A batch of actions taken.

Rn : 1d array, dtype: float, shape: [batch_size]

A batch of partial returns. For example, in n-step bootstrapping this is given by:

\[R^{(n)}_t\ =\ R_t + \gamma\,R_{t+1} + \dots \gamma^{n-1}\,R_{t+n-1}\]

In other words, it’s the non-bootstrapped part of the n-step return.

In : 1d array, dtype: float, shape: [batch_size]

A batch bootstrapping factor. For instance, in n-step bootstrapping this is given by \(I^{(n)}_t=\gamma^n\) if the episode is ongoing and \(I^{(n)}_t=0\) otherwise. This allows us to write the bootstrapped target as:

\[G^{(n)}_t=R^{(n)}_t+I^{(n)}_tQ(S_{t+n}, A_{t+n})\]
S_next : nd array, shape: [batch_size, …]

A batch of next-state observations.

A_next : 2d Tensor, shape: [batch_size, …]

A batch of (potential) next actions A_next. This argument is only used if update_strategy='sarsa'.

Returns:
losses : dict

A dict of losses/metrics, of type {name <str>: value <float>}.

bootstrap_target(Rn, In, S_next, A_next=None)

Get the bootstrapped target \(G^{(n)}_t=R^{(n)}_t+\gamma^nQ(S_{t+n}, A_{t+n})\).

Parameters:
Rn : 1d array, dtype: float, shape: [batch_size]

A batch of partial returns. For example, in n-step bootstrapping this is given by:

\[R^{(n)}_t\ =\ R_t + \gamma\,R_{t+1} + \dots \gamma^{n-1}\,R_{t+n-1}\]

In other words, it’s the non-bootstrapped part of the n-step return.

In : 1d array, dtype: float, shape: [batch_size]

A batch bootstrapping factor. For instance, in n-step bootstrapping this is given by \(I^{(n)}_t=\gamma^n\) if the episode is ongoing and \(I^{(n)}_t=0\) otherwise. This allows us to write the bootstrapped target as:

\[G^{(n)}_t=R^{(n)}_t+I^{(n)}_tQ(S_{t+n},A_{t+n})\]
S_next : nd array, shape: [batch_size, …]

A batch of next-state observations.

A_next : 2d Tensor, dtype: int, shape: [batch_size, num_actions]

A batch of (potential) next actions A_next. This argument is only used if update_strategy='sarsa'.

Returns:
Gn : 1d array, dtype: int, shape: [batch_size]

A batch of bootstrap-estimated returns \(G^{(n)}_t=R^{(n)}_t+I^{(n)}_tQ(S_{t+n},A_{t+n})\) computed according to given update_strategy.

sync_target_model(tau=1.0)

Synchronize the target model with the primary model.

Parameters:
tau : float between 0 and 1, optional

The amount of exponential smoothing to apply in the target update:

\[w_\text{target}\ \leftarrow\ (1 - \tau)\,w_\text{target} + \tau\,w_\text{primary}\]
update(s, a, r, done)

Update the Q-function.

Parameters:
s : state observation

A single state observation.

a : action

A single action.

r : float

A single observed reward.

done : bool

Whether the episode has finished.

class keras_gym.QTypeII(function_approximator, gamma=0.9, bootstrap_n=1, bootstrap_with_target_model=False, update_strategy='sarsa')[source]

A type-II state-action value function \(s\mapsto q(s,.)\).

Parameters:
function_approximator : FunctionApproximator object

The main function approximator.

gamma : float, optional

The discount factor for discounting future rewards.

bootstrap_n : positive int, optional

The number of steps in n-step bootstrapping. It specifies the number of steps over which we’re willing to delay bootstrapping. Large \(n\) corresponds to Monte Carlo updates and \(n=1\) corresponds to TD(0).

bootstrap_with_target_model : bool, optional

Whether to use the target_model when constructing a bootstrapped target. If False (default), the primary predict_model is used.

update_strategy : str, optional

The update strategy that we use to select the (would-be) next-action \(A_{t+n}\) in the bootstrapped target:

\[G^{(n)}_t\ =\ R^{(n)}_t + \gamma^n Q(S_{t+n}, A_{t+n})\]

Options are:

‘sarsa’

Sample the next action, i.e. use the action that was actually taken.

‘q_learning’

Take the action with highest Q-value under the current estimate, i.e. \(A_{t+n} = \arg\max_aQ(S_{t+n}, a)\). This is an off-policy method.

‘double_q_learning’

Same as ‘q_learning’, \(A_{t+n} = \arg\max_aQ(S_{t+n}, a)\), except that the value itself is computed using the target_model rather than the primary model, i.e.

\[\begin{split}A_{t+n}\ &=\ \arg\max_aQ_\text{primary}(S_{t+n}, a)\\ G^{(n)}_t\ &=\ R^{(n)}_t + \gamma^n Q_\text{target}(S_{t+n}, A_{t+n})\end{split}\]
‘expected_sarsa’

Similar to SARSA in that it’s on-policy, except that we take the expectated Q-value rather than a sample of it, i.e.

\[G^{(n)}_t\ =\ R^{(n)}_t + \gamma^n\sum_a\pi(a|s)\,Q(S_{t+n}, a)\]
__call__(s, a=None, use_target_model=False)

Evaluate the Q-function.

Parameters:
s : state observation

A single state observation.

a : action, optional

A single action.

use_target_model : bool, optional

Whether to use the target_model internally. If False (default), the predict_model is used.

Returns:
Q : float or array of floats

If action a is provided, a single float representing \(q(s,a)\) is returned. If, on the other hand, a is left unspecified, a vector representing \(q(s,.)\) is returned instead. The shape of the latter return value is [num_actions], which is only well-defined for discrete action spaces.

batch_eval(S, A=None, use_target_model=False)[source]

Evaluate the Q-function on a batch of state (or state-action) observations.

Parameters:
S : nd array, shape: [batch_size, …]

A batch of state observations.

A : 1d array, dtype: int, shape: [batch_size], optional

A batch of actions that were taken.

use_target_model : bool, optional

Whether to use the target_model internally. If False (default), the predict_model is used.

Returns:
Q : 1d or 2d array of floats

If action A is provided, a 1d array representing a batch of \(q(s,a)\) is returned. If, on the other hand, A is left unspecified, a vector representing a batch of \(q(s,.)\) is returned instead. The shape of the latter return value is [batch_size, num_actions], which is only well-defined for discrete action spaces.

batch_update(S, A, Rn, In, S_next, A_next=None)

Update the value function on a batch of transitions.

Parameters:
S : nd array, shape: [batch_size, …]

A batch of state observations.

A : nd Tensor, shape: [batch_size, …]

A batch of actions taken.

Rn : 1d array, dtype: float, shape: [batch_size]

A batch of partial returns. For example, in n-step bootstrapping this is given by:

\[R^{(n)}_t\ =\ R_t + \gamma\,R_{t+1} + \dots \gamma^{n-1}\,R_{t+n-1}\]

In other words, it’s the non-bootstrapped part of the n-step return.

In : 1d array, dtype: float, shape: [batch_size]

A batch bootstrapping factor. For instance, in n-step bootstrapping this is given by \(I^{(n)}_t=\gamma^n\) if the episode is ongoing and \(I^{(n)}_t=0\) otherwise. This allows us to write the bootstrapped target as:

\[G^{(n)}_t=R^{(n)}_t+I^{(n)}_tQ(S_{t+n}, A_{t+n})\]
S_next : nd array, shape: [batch_size, …]

A batch of next-state observations.

A_next : 2d Tensor, shape: [batch_size, …]

A batch of (potential) next actions A_next. This argument is only used if update_strategy='sarsa'.

Returns:
losses : dict

A dict of losses/metrics, of type {name <str>: value <float>}.

bootstrap_target(Rn, In, S_next, A_next=None)

Get the bootstrapped target \(G^{(n)}_t=R^{(n)}_t+\gamma^nQ(S_{t+n}, A_{t+n})\).

Parameters:
Rn : 1d array, dtype: float, shape: [batch_size]

A batch of partial returns. For example, in n-step bootstrapping this is given by:

\[R^{(n)}_t\ =\ R_t + \gamma\,R_{t+1} + \dots \gamma^{n-1}\,R_{t+n-1}\]

In other words, it’s the non-bootstrapped part of the n-step return.

In : 1d array, dtype: float, shape: [batch_size]

A batch bootstrapping factor. For instance, in n-step bootstrapping this is given by \(I^{(n)}_t=\gamma^n\) if the episode is ongoing and \(I^{(n)}_t=0\) otherwise. This allows us to write the bootstrapped target as:

\[G^{(n)}_t=R^{(n)}_t+I^{(n)}_tQ(S_{t+n},A_{t+n})\]
S_next : nd array, shape: [batch_size, …]

A batch of next-state observations.

A_next : 2d Tensor, dtype: int, shape: [batch_size, num_actions]

A batch of (potential) next actions A_next. This argument is only used if update_strategy='sarsa'.

Returns:
Gn : 1d array, dtype: int, shape: [batch_size]

A batch of bootstrap-estimated returns \(G^{(n)}_t=R^{(n)}_t+I^{(n)}_tQ(S_{t+n},A_{t+n})\) computed according to given update_strategy.

sync_target_model(tau=1.0)

Synchronize the target model with the primary model.

Parameters:
tau : float between 0 and 1, optional

The amount of exponential smoothing to apply in the target update:

\[w_\text{target}\ \leftarrow\ (1 - \tau)\,w_\text{target} + \tau\,w_\text{primary}\]
update(s, a, r, done)

Update the Q-function.

Parameters:
s : state observation

A single state observation.

a : action

A single action.

r : float

A single observed reward.

done : bool

Whether the episode has finished.

Updateable Policies
keras_gym.GaussianPolicy An updateable policy for environments with a continuous action space, i.e.
keras_gym.SoftmaxPolicy Updateable policy for discrete action spaces.
class keras_gym.GaussianPolicy(function_approximator, update_strategy='vanilla', ppo_clip_eps=0.2, entropy_beta=0.01, random_seed=None)[source]

An updateable policy for environments with a continuous action space, i.e. a Box. It models the policy \(\pi_\theta(a|s)\) as a normal distribution with conditional parameters \((\mu_\theta(s), \sigma_\theta(s))\).

Important

This environment requires that the env is with:

env = km.wrappers.BoxToReals(env)

This wrapper decompactifies the Box action space.

Parameters:
function_approximator : FunctionApproximator object

The main function approximator.

update_strategy : str, optional

The strategy for updating our policy. This typically determines the loss function that we use for our policy function approximator.

Options are:

‘vanilla’

Plain vanilla policy gradient. The corresponding (surrogate) loss function that we use is:

\[J(\theta)\ =\ \hat{\mathbb{E}}_t \left\{ -\mathcal{A}_t\,\log\pi_\theta(A_t|S_t) \right\}\]

where \(\mathcal{A}_t=\mathcal{A}(S_t,A_t)\) is the advantage at time step \(t\).

‘ppo’

Proximal policy optimization uses a clipped proximal loss:

\[J(\theta)\ =\ \hat{\mathbb{E}}_t \left\{ \min\Big( \rho_t(\theta)\,\mathcal{A}_t\,,\ \tilde{\rho}_t(\theta)\,\mathcal{A}_t \Big) \right\}\]

where \(\rho_t(\theta)\) is the probability ratio:

\[\rho_t(\theta)\ =\ \frac {\pi_\theta(A_t|S_t)} {\pi_{\theta_\text{old}}(A_t|S_t)}\]

and \(\tilde{\rho}_t(\theta)\) is its clipped version:

\[\tilde{\rho}_t(\theta)\ =\ \text{clip}\big( \rho_t(\theta), 1-\epsilon, 1+\epsilon\big)\]
‘cross_entropy’

Straightforward categorical cross-entropy (from logits). This loss function does not make use of the advantages Adv. Instead, it minimizes the cross entropy between the behavior policy \(\pi_b(a|s)\) and the learned policy \(\pi_\theta(a|s)\):

\[J(\theta)\ =\ \hat{\mathbb{E}}_t\left\{ -\sum_a \pi_b(a|S_t)\, \log \pi_\theta(a|S_t) \right\}\]
ppo_clip_eps : float, optional

The clipping parameter \(\epsilon\) in the PPO clipped surrogate loss. This option is only applicable if update_strategy='ppo'.

entropy_beta : float, optional

The coefficient of the entropy bonus term in the policy objective.

__call__(s, use_target_model=False)

Draw an action from the current policy \(\pi(a|s)\).

Parameters:
s : state observation

A single state observation.

use_target_model : bool, optional

Whether to use the target_model internally. If False (default), the predict_model is used.

Returns:
a : action

A single action proposed under the current policy.

batch_eval(S, use_target_model=False)

Evaluate the policy on a batch of state observations.

Parameters:
S : nd array, shape: [batch_size, …]

A batch of state observations.

use_target_model : bool, optional

Whether to use the target_model internally. If False (default), the predict_model is used.

Returns:
A : nd array, shape: [batch_size, …]

A batch of sampled actions.

batch_update(S, A, Adv)

Update the policy on a batch of transitions.

Parameters:
S : nd array, shape: [batch_size, …]

A batch of state observations.

A : nd array, shape: [batch_size, …]

A batch of actions taken by the behavior policy.

Adv : 1d array, dtype: float, shape: [batch_size]

A value for the advantage \(\mathcal{A}(s,a) = q(s,a) - v(s)\). This might be sampled and/or estimated version of the true advantage.

Returns:
losses : dict

A dict of losses/metrics, of type {name <str>: value <float>}.

dist_params(s, use_target_model=False)

Get the parameters of the (conditional) probability distribution \(\pi(a|s)\).

Parameters:
s : state observation

A single state observation.

use_target_model : bool, optional

Whether to use the target_model internally. If False (default), the predict_model is used.

Returns:
*params : tuple of arrays

The raw distribution parameters.

greedy(s, use_target_model=False)

Draw the greedy action, i.e. \(\arg\max_a\pi(a|s)\).

Parameters:
s : state observation

A single state observation.

use_target_model : bool, optional

Whether to use the target_model internally. If False (default), the predict_model is used.

Returns:
a : action

A single action proposed under the current policy.

policy_loss_with_metrics(Adv, A=None)

This method constructs the policy loss as a scalar-valued Tensor, together with a dictionary of metrics (also scalars).

This method may be overridden to construct a custom policy loss and/or to change the accompanying metrics.

Parameters:
Adv : 1d Tensor, shape: [batch_size]

A batch of advantages.

A : nd Tensor, shape: [batch_size, …]

A batch of actions taken under the behavior policy. For some choices of policy loss, e.g. update_strategy='sac' this input is ignored.

Returns:
loss, metrics : (Tensor, dict of Tensors)

The policy loss along with some metrics, which is a dict of type {name <str>: metric <Tensor>}. The loss and each of the metrics (dict values) are scalar Tensors, i.e. Tensors with ndim=0.

The loss is passed to a keras Model using train_model.add_loss(loss). Similarly, each metric in the metric dict is passed to the model using train_model.add_metric(metric, name=name, aggregation='mean').

sync_target_model(tau=1.0)

Synchronize the target model with the primary model.

Parameters:
tau : float between 0 and 1, optional

The amount of exponential smoothing to apply in the target update:

\[w_\text{target}\ \leftarrow\ (1 - \tau)\,w_\text{target} + \tau\,w_\text{primary}\]
update(s, a, advantage)

Update the policy.

Parameters:
s : state observation

A single state observation.

a : action

A single action.

advantage : float

A value for the advantage \(\mathcal{A}(s,a) = q(s,a) - v(s)\). This might be sampled and/or estimated version of the true advantage.

class keras_gym.SoftmaxPolicy(function_approximator, update_strategy='vanilla', ppo_clip_eps=0.2, entropy_beta=0.01, random_seed=None)[source]

Updateable policy for discrete action spaces.

Parameters:
function_approximator : FunctionApproximator object

The main function approximator.

update_strategy : str, callable, optional

The strategy for updating our policy. This determines the loss function that we use for our policy function approximator. If you wish to use a custom policy loss, you can override the policy_loss_with_metrics() method.

Provided options are:

‘vanilla’

Plain vanilla policy gradient. The corresponding (surrogate) loss function that we use is:

\[J(\theta)\ =\ -\mathcal{A}(s,a)\,\ln\pi(a|s,\theta)\]
‘ppo’

Proximal policy optimization uses a clipped proximal loss:

\[J(\theta)\ =\ \min\Big( r(\theta)\,\mathcal{A}(s,a)\,,\ \text{clip}\big( r(\theta), 1-\epsilon, 1+\epsilon\big) \,\mathcal{A}(s,a)\Big)\]

where \(r(\theta)\) is the probability ratio:

\[r(\theta)\ =\ \frac {\pi(a|s,\theta)} {\pi(a|s,\theta_\text{old})}\]
‘cross_entropy’

Straightforward categorical cross-entropy (from logits). This loss function does not make use of the advantages Adv. Instead, it minimizes the cross entropy between the behavior policy \(\pi_b(a|s)\) and the learned policy \(\pi_\theta(a|s)\):

\[J(\theta)\ =\ \hat{\mathbb{E}}_t\left\{ -\sum_a \pi_b(a|S_t)\, \log \pi_\theta(a|S_t) \right\}\]
ppo_clip_eps : float, optional

The clipping parameter \(\epsilon\) in the PPO clipped surrogate loss. This option is only applicable if update_strategy='ppo'.

entropy_beta : float, optional

The coefficient of the entropy bonus term in the policy objective.

random_seed : int, optional

Sets the random state to get reproducible results.

__call__(s, use_target_model=False)[source]

Draw an action from the current policy \(\pi(a|s)\).

Parameters:
s : state observation

A single state observation.

use_target_model : bool, optional

Whether to use the target_model internally. If False (default), the predict_model is used.

Returns:
a : action

A single action proposed under the current policy.

batch_eval(S, use_target_model=False)

Evaluate the policy on a batch of state observations.

Parameters:
S : nd array, shape: [batch_size, …]

A batch of state observations.

use_target_model : bool, optional

Whether to use the target_model internally. If False (default), the predict_model is used.

Returns:
A : nd array, shape: [batch_size, …]

A batch of sampled actions.

batch_update(S, A, Adv)

Update the policy on a batch of transitions.

Parameters:
S : nd array, shape: [batch_size, …]

A batch of state observations.

A : nd array, shape: [batch_size, …]

A batch of actions taken by the behavior policy.

Adv : 1d array, dtype: float, shape: [batch_size]

A value for the advantage \(\mathcal{A}(s,a) = q(s,a) - v(s)\). This might be sampled and/or estimated version of the true advantage.

Returns:
losses : dict

A dict of losses/metrics, of type {name <str>: value <float>}.

dist_params(s, use_target_model=False)

Get the parameters of the (conditional) probability distribution \(\pi(a|s)\).

Parameters:
s : state observation

A single state observation.

use_target_model : bool, optional

Whether to use the target_model internally. If False (default), the predict_model is used.

Returns:
*params : tuple of arrays

The raw distribution parameters.

greedy(s, use_target_model=False)

Draw the greedy action, i.e. \(\arg\max_a\pi(a|s)\).

Parameters:
s : state observation

A single state observation.

use_target_model : bool, optional

Whether to use the target_model internally. If False (default), the predict_model is used.

Returns:
a : action

A single action proposed under the current policy.

policy_loss_with_metrics(Adv, A=None)

This method constructs the policy loss as a scalar-valued Tensor, together with a dictionary of metrics (also scalars).

This method may be overridden to construct a custom policy loss and/or to change the accompanying metrics.

Parameters:
Adv : 1d Tensor, shape: [batch_size]

A batch of advantages.

A : nd Tensor, shape: [batch_size, …]

A batch of actions taken under the behavior policy. For some choices of policy loss, e.g. update_strategy='sac' this input is ignored.

Returns:
loss, metrics : (Tensor, dict of Tensors)

The policy loss along with some metrics, which is a dict of type {name <str>: metric <Tensor>}. The loss and each of the metrics (dict values) are scalar Tensors, i.e. Tensors with ndim=0.

The loss is passed to a keras Model using train_model.add_loss(loss). Similarly, each metric in the metric dict is passed to the model using train_model.add_metric(metric, name=name, aggregation='mean').

sync_target_model(tau=1.0)

Synchronize the target model with the primary model.

Parameters:
tau : float between 0 and 1, optional

The amount of exponential smoothing to apply in the target update:

\[w_\text{target}\ \leftarrow\ (1 - \tau)\,w_\text{target} + \tau\,w_\text{primary}\]
update(s, a, advantage)

Update the policy.

Parameters:
s : state observation

A single state observation.

a : action

A single action.

advantage : float

A value for the advantage \(\mathcal{A}(s,a) = q(s,a) - v(s)\). This might be sampled and/or estimated version of the true advantage.

Actor-Critics
keras_gym.ActorCritic A generic actor-critic, combining an updateable policy with a value function.
keras_gym.SoftActorCritic Implementation of a soft actor-critic (SAC), which uses entropy regularization in the value function as well as in its policy updates.
class keras_gym.ActorCritic(policy, v_func, value_loss_weight=1.0)[source]

A generic actor-critic, combining an updateable policy with a value function.

The added value of using an ActorCritic to combine a policy with a value function is that it avoids having to feed in S (potentially very large) three times at training time. Instead, it only feeds it in once.

Parameters:
policy : Policy object

An updateable policy.

v_func : value-function object

A state value function \(v(s)\).

value_loss_weight : float, optional

Relative weight to give to the value-function loss:

loss = policy_loss + value_loss_weight * value_loss
__call__(s)

Draw an action from the current policy \(\pi(a|s)\) and get the expected value \(v(s)\).

Parameters:
s : state observation

A single state observation.

Returns:
a, v : tuple (1d array of floats, float)

Returns a pair representing \((a, v(s))\).

batch_eval(S, use_target_model=False)

Evaluate the actor-critic on a batch of state observations.

Parameters:
S : nd array, shape: [batch_size, …]

A batch of state observations.

use_target_model : bool, optional

Whether to use the target_model internally. If False (default), the predict_model is used.

Returns:
A, V : arrays, shapes: [batch_size, …] and [batch_size]

A batch of sampled actions A and state values V.

batch_update(S, A, Rn, In, S_next, A_next=None)

Update both actor and critic on a batch of transitions.

Parameters:
S : nd array, shape: [batch_size, …]

A batch of state observations.

A : nd Tensor, shape: [batch_size, …]

A batch of actions taken.

Rn : 1d array, dtype: float, shape: [batch_size]

A batch of partial returns. For example, in n-step bootstrapping this is given by:

\[R^{(n)}_t\ =\ R_t + \gamma\,R_{t+1} + \dots \gamma^{n-1}\,R_{t+n-1}\]

In other words, it’s the non-bootstrapped part of the n-step return.

In : 1d array, dtype: float, shape: [batch_size]

A batch bootstrapping factor. For instance, in n-step bootstrapping this is given by \(I^{(n)}_t=\gamma^n\) if the episode is ongoing and \(I^{(n)}_t=0\) otherwise. This allows us to write the bootstrapped target as \(G^{(n)}_t=R^{(n)}_t+I^{(n)}_tQ(S_{t+n}, A_{t+n})\).

S_next : nd array, shape: [batch_size, …]

A batch of next-state observations.

A_next : 2d Tensor, shape: [batch_size, …]

A batch of (potential) next actions A_next. This argument is only used if update_strategy='sarsa'.

Returns:
losses : dict

A dict of losses/metrics, of type {name <str>: value <float>}.

dist_params(s)

Get the distribution parameters under the current policy \(\pi(a|s)\) and get the expected value \(v(s)\).

Parameters:
s : state observation

A single state observation.

Returns:
dist_params, v : tuple (1d array of floats, float)

Returns a pair representing the distribution parameters of \(\pi(a|s)\) and the estimated state value \(v(s)\).

classmethod from_func(function_approximator, gamma=0.9, bootstrap_n=1, bootstrap_with_target_model=False, entropy_beta=0.01, update_strategy='vanilla', random_seed=None)[source]

Create instance directly from a FunctionApproximator object.

Parameters:
function_approximator : FunctionApproximator object

The main function approximator.

gamma : float, optional

The discount factor for discounting future rewards.

bootstrap_n : positive int, optional

The number of steps in n-step bootstrapping. It specifies the number of steps over which we’re willing to delay bootstrapping. Large \(n\) corresponds to Monte Carlo updates and \(n=1\) corresponds to TD(0).

bootstrap_with_target_model : bool, optional

Whether to use the target_model when constructing a bootstrapped target. If False (default), the primary predict_model is used.

entropy_beta : float, optional

The coefficient of the entropy bonus term in the policy objective.

update_strategy : str, callable, optional

The strategy for updating our policy. This determines the loss function that we use for our policy function approximator. If you wish to use a custom policy loss, you can override the policy_loss_with_metrics() method.

Provided options are:

‘vanilla’

Plain vanilla policy gradient. The corresponding (surrogate) loss function that we use is:

\[J(\theta)\ =\ -\mathcal{A}(s,a)\,\ln\pi(a|s,\theta)\]
‘ppo’

Proximal policy optimization uses a clipped proximal loss:

\[J(\theta)\ =\ \min\Big( r(\theta)\,\mathcal{A}(s,a)\,,\ \text{clip}\big( r(\theta), 1-\epsilon, 1+\epsilon\big) \,\mathcal{A}(s,a)\Big)\]

where \(r(\theta)\) is the probability ratio:

\[r(\theta)\ =\ \frac {\pi(a|s,\theta)} {\pi(a|s,\theta_\text{old})}\]
‘cross_entropy’

Straightforward categorical cross-entropy (from logits). This loss function does not make use of the advantages Adv. Instead, it minimizes the cross entropy between the behavior policy \(\pi_b(a|s)\) and the learned policy \(\pi_\theta(a|s)\):

\[J(\theta)\ =\ \hat{\mathbb{E}}_t\left\{ -\sum_a \pi_b(a|S_t)\, \log \pi_\theta(a|S_t) \right\}\]
random_seed : int, optional

Sets the random state to get reproducible results.

greedy(s)

Draw a greedy action \(a=\arg\max_{a'}\pi(a'|s)\) and get the expected value \(v(s)\).

Parameters:
s : state observation

A single state observation.

Returns:
a, v : tuple (1d array of floats, float)

Returns a pair representing \((a, v(s))\).

sync_target_model(tau=1.0)

Synchronize the target model with the primary model.

Parameters:
tau : float between 0 and 1, optional

The amount of exponential smoothing to apply in the target update:

\[w_\text{target}\ \leftarrow\ (1 - \tau)\,w_\text{target} + \tau\,w_\text{primary}\]
update(s, a, r, done)

Update both actor and critic.

Parameters:
s : state observation

A single state observation.

a : action

A single action.

r : float

A single observed reward.

done : bool

Whether the episode has finished.

class keras_gym.SoftActorCritic(policy, v_func, q_func1, q_func2, value_loss_weight=1.0)[source]

Implementation of a soft actor-critic (SAC), which uses entropy regularization in the value function as well as in its policy updates.

Parameters:
policy : a policy object

An updateable policy object \(\pi(a|s)\).

v_func : v-function object

A state-action value function. This is used as the entropy-regularized value function (critic).

q_func1 : q-function object

A type-I state-action value function. This is used as the target for both the policy (actor) and the state value function (critic).

q_func2 : q-function object

Same as q_func1. SAC uses two q-functions to avoid overfitting due to overly optimistic value estimates.

value_loss_weight : float, optional

Relative weight to give to the value-function loss:

loss = policy_loss + value_loss_weight * value_loss
__call__(s)

Draw an action from the current policy \(\pi(a|s)\) and get the expected value \(v(s)\).

Parameters:
s : state observation

A single state observation.

Returns:
a, v : tuple (1d array of floats, float)

Returns a pair representing \((a, v(s))\).

batch_eval(S, use_target_model=False)

Evaluate the actor-critic on a batch of state observations.

Parameters:
S : nd array, shape: [batch_size, …]

A batch of state observations.

use_target_model : bool, optional

Whether to use the target_model internally. If False (default), the predict_model is used.

Returns:
A, V : arrays, shapes: [batch_size, …] and [batch_size]

A batch of sampled actions A and state values V.

batch_update(S, A, Rn, In, S_next, A_next=None)[source]

Update both actor and critic on a batch of transitions.

Parameters:
S : nd array, shape: [batch_size, …]

A batch of state observations.

A : nd Tensor, shape: [batch_size, …]

A batch of actions taken.

Rn : 1d array, dtype: float, shape: [batch_size]

A batch of partial returns. For example, in n-step bootstrapping this is given by:

\[R^{(n)}_t\ =\ R_t + \gamma\,R_{t+1} + \dots \gamma^{n-1}\,R_{t+n-1}\]

In other words, it’s the non-bootstrapped part of the n-step return.

In : 1d array, dtype: float, shape: [batch_size]

A batch bootstrapping factor. For instance, in n-step bootstrapping this is given by \(I^{(n)}_t=\gamma^n\) if the episode is ongoing and \(I^{(n)}_t=0\) otherwise. This allows us to write the bootstrapped target as \(G^{(n)}_t=R^{(n)}_t+I^{(n)}_tQ(S_{t+n}, A_{t+n})\).

S_next : nd array, shape: [batch_size, …]

A batch of next-state observations.

A_next : 2d Tensor, shape: [batch_size, …]

A batch of (potential) next actions A_next. This argument is only used if update_strategy='sarsa'.

Returns:
losses : dict

A dict of losses/metrics, of type {name <str>: value <float>}.

dist_params(s)

Get the distribution parameters under the current policy \(\pi(a|s)\) and get the expected value \(v(s)\).

Parameters:
s : state observation

A single state observation.

Returns:
dist_params, v : tuple (1d array of floats, float)

Returns a pair representing the distribution parameters of \(\pi(a|s)\) and the estimated state value \(v(s)\).

classmethod from_func(function_approximator, gamma=0.9, bootstrap_n=1, q_type=None, entropy_beta=0.01, random_seed=None)[source]

Create instance directly from a FunctionApproximator object.

Parameters:
function_approximator : FunctionApproximator object

The main function approximator.

gamma : float, optional

The discount factor for discounting future rewards.

bootstrap_n : positive int, optional

The number of steps in n-step bootstrapping. It specifies the number of steps over which we’re willing to delay bootstrapping. Large \(n\) corresponds to Monte Carlo updates and \(n=1\) corresponds to TD(0).

q_type : 1 or 2, optional

Whether to model the q-function as type-I or type-II. This defaults to type-II for discrete action spaces and type-I otherwise.

entropy_beta : float, optional

The coefficient of the entropy bonus term in the policy objective.

random_seed : int, optional

Sets the random state to get reproducible results.

greedy(s)

Draw a greedy action \(a=\arg\max_{a'}\pi(a'|s)\) and get the expected value \(v(s)\).

Parameters:
s : state observation

A single state observation.

Returns:
a, v : tuple (1d array of floats, float)

Returns a pair representing \((a, v(s))\).

sync_target_model(tau=1.0)

Synchronize the target model with the primary model.

Parameters:
tau : float between 0 and 1, optional

The amount of exponential smoothing to apply in the target update:

\[w_\text{target}\ \leftarrow\ (1 - \tau)\,w_\text{target} + \tau\,w_\text{primary}\]
update(s, a, r, done)

Update both actor and critic.

Parameters:
s : state observation

A single state observation.

a : action

A single action.

r : float

A single observed reward.

done : bool

Whether the episode has finished.

Policies

In reinforcement learning (RL), a policy can either be derived from a state-action value function or it be learned directly as an updateable policy. These two approaches are called value-based and policy-based RL, respectively. The way we update our policies differs quite a bit between the two approaches.

For value-based RL, we have algorithms like TD(0), Monte Carlo and everything in between. The optimization problem that we use to update our function approximator is typically ordinary least-squares regression (or Huber loss).

In policy-based RL, on the other hand, we update our function approximators using direct policy gradient techniques. This makes the optimization problem quite different from ordinary supervised learning.

Below we list all policy objects provided by keras-gym.

Updateable Policies and Actor-Critics

For updateable policies have a look at the relevant function approximator section:

Value-Based Policies

These policies are derived from a Q-function object. See example below:

import gym
import keras_gym as km

# the cart-pole MDP
env = gym.make(...)

# use linear function approximator for q(s,a)
func = km.predefined.LinearFunctionApproximator(env, lr=0.01)
q = km.Q(func, update_strategy='q_learning')
pi = EpsilonGreedy(q, epsilon=0.1)

# get some dummy state observation
s = env.reset()

# draw an action, given state s
a = pi(s)

Special Policies

We’ve also got some special policies, which are policies that don’t depend on any learned function approximator. The two main examples that are available right now are RandomPolicy and UserInputPolicy. The latter allows you to pick the actions yourself as the episode runs.

Objects

Value-Based Policies
keras_gym.policies.EpsilonGreedy Value-based policy to select actions using epsilon-greedy strategy.
class keras_gym.policies.EpsilonGreedy(q_function, epsilon=0.1, random_seed=None)[source]

Value-based policy to select actions using epsilon-greedy strategy.

Parameters:
q_function : callable

A state-action value function object.

epsilon : float between 0 and 1

The probability of selecting an action uniformly at random.

random_seed : int, optional

Sets the random state to get reproducible results.

__call__(s)[source]

Draw an action from the current policy \(\pi(a|s)\).

Parameters:
s : state observation

A single state observation.

Returns:
a : action

A single action proposed under the current policy.

dist_params(s)[source]

Get the parameters of the (conditional) probability distribution \(\pi(a|s)\).

Parameters:
s : state observation

A single state observation.

Returns:
params : nd array

An array containing the distribution parameters.

greedy(s)[source]

Draw the greedy action, i.e. \(\arg\max_a\pi(a|s)\).

Parameters:
s : state observation

A single state observation.

Returns:
a : action

A single action proposed under the current policy.

set_epsilon(epsilon)[source]

Change the value for epsilon.

Parameters:
epsilon : float between 0 and 1

The probability of selecting an action uniformly at random.

Returns:
self

The updated instance.

Special Policies
keras_gym.policies.RandomPolicy Value-based policy to select actions using epsilon-greedy strategy.
keras_gym.policies.UserInputPolicy A policy that prompts the user to take an action.
class keras_gym.policies.RandomPolicy(env, random_seed=None)[source]

Value-based policy to select actions using epsilon-greedy strategy.

Parameters:
env : gym environment

The gym environment is used to sample from the action space.

random_seed : int, optional

Sets the random state to get reproducible results.

__call__(s)[source]

Draw an action from the current policy \(\pi(a|s)\).

Parameters:
s : state observation

A single state observation.

Returns:
a : action

A single action proposed under the current policy.

dist_params(s)[source]

Get the parameters of the (conditional) probability distribution \(\pi(a|s)\).

Parameters:
s : state observation

A single state observation.

Returns:
params : nd array

An array containing the distribution parameters.

greedy(s)[source]

Draw the greedy action, i.e. \(\arg\max_a\pi(a|s)\).

Parameters:
s : state observation

A single state observation.

Returns:
a : action

A single action proposed under the current policy.

class keras_gym.policies.UserInputPolicy(env, render_before_prompt=False)[source]

A policy that prompts the user to take an action.

Parameters:
env : gym environment

The gym environment is used to sample from the action space.

render_before_prompt : bool, optional

Whether to render the env before prompting the user to pick an action.

__call__(s)[source]

Draw an action from the current policy \(\pi(a|s)\).

Parameters:
s : state observation

A single state observation.

Returns:
a : action

A single action proposed under the current policy.

greedy(s)[source]

Draw the greedy action, i.e. \(\arg\max_a\pi(a|s)\).

Parameters:
s : state observation

A single state observation.

Returns:
a : action

A single action proposed under the current policy.

Probability Distributions

This is a collection of probability distributions that can be used as part of a computation graph.

All methods are differentiable, including the sample() method via the reparametrization trick or variations thereof. This means that they may be used in constructing loss functions that require quantities like (cross)entropy or KL-divergence.

Objects

Differentiable Probability Distributions
keras_gym.proba_dists.CategoricalDist Differential implementation of a categorical distribution.
keras_gym.proba_dists.NormalDist Implementation of a normal distribution.
class keras_gym.proba_dists.CategoricalDist(logits, boltzmann_tau=0.2, name='categorical_dist', random_seed=None)[source]

Differential implementation of a categorical distribution.

Parameters:
logits : 2d Tensor, dtype: float, shape: [batch_size, num_categories]

A batch of logits \(z\in \mathbb{R}^n\) with \(n=\) num_categories.

boltzmann_tau : float, optional

The Boltzmann temperature that is used in generating near one-hot propensities in sample(). A smaller number means closer to deterministic, one-hot encoded samples. A larger number means better numerical stability. A good value for \(\tau\) is one that offers a good trade-off between these two desired properties.

name : str, optional

Name scope of the distribution.

random_seed : int, optional

To get reproducible results.

cross_entropy(other)[source]

Compute the cross-entropy of a probability distribution \(p_\text{other}\) relative to the current probablity distribution \(p_\text{self}\), symbolically:

\[\text{CE}[p_\text{self}, p_\text{other}]\ =\ -\sum p_\text{self}\,\log p_\text{other}\]
Parameters:
other : probability dist

The other probability dist must be of the same type as self.

Returns:
cross_entropy : 1d Tensor, shape: [batch_size]

The cross-entropy.

entropy()[source]

Compute the entropy of the probability distribution.

Parameters:
x : nd Tensor, shape: [batch_size, …]

A batch of specific variates.

Returns:
entropy : 1d Tensor, shape: [batch_size]

The entropy of the probability distribution.

kl_divergence(other)[source]

Compute the Kullback-Leibler divergence of a probability distribution \(p_\text{other}\) relative to the current probablity distribution \(p_\text{self}\), symbolically:

\[\text{KL}[p_\text{self}, p_\text{other}]\ =\ -\sum p_\text{self}\, \log\frac{p_\text{other}}{p_\text{self}}\]
Parameters:
other : probability dist

The other probability dist must be of the same type as self.

Returns:
kl_divergence : 1d Tensor, shape: [batch_size]

The KL-divergence.

log_proba(x)[source]

Compute the log-probability associated with specific variates.

Parameters:
x : nd Tensor, shape: [batch_size, …]

A batch of specific variates.

Returns:
log_proba : 1d Tensor, shape: [batch_size]

The log-probabilities.

sample()[source]

Sample from the probability distribution. In order to return a differentiable sample, this method uses the approach outlined in [ArXiv:1611.01144].

Returns:
sample : 2d array, shape: [batch_size, num_categories]

The sampled variates. The returned arrays are near one-hot encoded versions of deterministic variates.

class keras_gym.proba_dists.NormalDist(mu, logvar, name='normal_dist', random_seed=None)[source]

Implementation of a normal distribution.

Parameters:
mu : 1d Tensor, dtype: float, shape: [batch_size, n]

A batch of vectors of means \(\mu\in\mathbb{R}^n\).

logvar : 1d Tensor, dtype: float, shape: [batch_size, n]

A batch of vectors of log-variances \(\log(\sigma^2)\in\mathbb{R}^n\)

name : str, optional

Name scope of the distribution.

random_seed : int, optional

To get reproducible results.

cross_entropy(other)[source]

Compute the cross-entropy of a probability distribution \(p_\text{other}\) relative to the current probablity distribution \(p_\text{self}\), symbolically:

\[\text{CE}[p_\text{self}, p_\text{other}]\ =\ -\sum p_\text{self}\,\log p_\text{other}\]
Parameters:
other : probability dist

The other probability dist must be of the same type as self.

Returns:
cross_entropy : 1d Tensor, shape: [batch_size]

The cross-entropy.

entropy()[source]

Compute the entropy of the probability distribution.

Parameters:
x : nd Tensor, shape: [batch_size, …]

A batch of specific variates.

Returns:
entropy : 1d Tensor, shape: [batch_size]

The entropy of the probability distribution.

kl_divergence(other)[source]

Compute the Kullback-Leibler divergence of a probability distribution \(p_\text{other}\) relative to the current probablity distribution \(p_\text{self}\), symbolically:

\[\text{KL}[p_\text{self}, p_\text{other}]\ =\ -\sum p_\text{self}\, \log\frac{p_\text{other}}{p_\text{self}}\]
Parameters:
other : probability dist

The other probability dist must be of the same type as self.

Returns:
kl_divergence : 1d Tensor, shape: [batch_size]

The KL-divergence.

log_proba(x)[source]

Compute the log-probability associated with specific variates.

Parameters:
x : nd Tensor, shape: [batch_size, …]

A batch of specific variates.

Returns:
log_proba : 1d Tensor, shape: [batch_size]

The log-probabilities.

sample()[source]

Sample from the (multi) normal distribution.

Returns:
sample : 1d Tensor, shape: [batch_size, actions_ndim]

The sampled normally-distributed variates.

Caching

In RL we often make use of data caching. This might be short-term caching, over the course of an episode, or it might be long-term caching as is done in experience replay.

Short-term Caching

Our short-term caching objects allow us to cache experience within an episode. For instance MonteCarloCache caches all transitions collected over an entire episode and then gives us back the the \(\gamma\)-discounted returns when the episode finishes.

Another short-term caching object is NStepCache, which keeps an \(n\)-sized sliding window of transitions that allows us to do \(n\)-step bootstrapping.

Experience Replay Buffer

At the moment, we only have one long-term caching object, which is the ExperienceReplayBuffer. This object can hold an arbitrary number of transitions; the only constraint is the amount of available memory on your machine.

The way we use learn from the experience stored in the ExperienceReplayBuffer is by sampling from it and then feeding the batch of transitions to our function approximator.

Objects

Short-term Caching
keras_gym.caching.MonteCarloCache
keras_gym.caching.NStepCache A convenient helper class for n-step bootstrapping.
class keras_gym.caching.MonteCarloCache(env, gamma)[source]
add(s, a, r, done)[source]

Add a transition to the experience cache.

Parameters:
s : state observation

A single state observation.

a : action

A single action.

r : float

A single observed reward.

done : bool

Whether the episode has finished.

flush()[source]

Flush all transitions from the cache.

Returns:
S, A, G : tuple of arrays

The returned tuple represents a batch of preprocessed transitions:

(S, A, G)

pop()[source]

Pop a single transition from the cache.

Returns:
S, A, G : tuple of arrays, batch_size=1

The returned tuple represents a batch of preprocessed transitions:

(S, A, G)

reset()[source]

Reset the cache to the initial state.

class keras_gym.caching.NStepCache(env, n, gamma)[source]

A convenient helper class for n-step bootstrapping.

Parameters:
env : gym environment

The main gym environment. This is needed to determine num_actions.

n : positive int

The number of steps over which to bootstrap.

gamma : float between 0 and 1

The amount by which to discount future rewards.

add(s, a, r, done)[source]

Add a transition to the experience cache.

Parameters:
s : state observation

A single state observation.

a : action

A single action.

r : float

A single observed reward.

done : bool

Whether the episode has finished.

flush()[source]

Flush all transitions from the cache.

Returns:
S, A, Rn, In, S_next, A_next : tuple of arrays

The returned tuple represents a batch of preprocessed transitions:

(S, A, Rn, In, S_next, A_next)

These are typically used for bootstrapped updates, e.g. minimizing the bootstrapped MSE:

\[\left( R^{(n)}_t + I^{(n)}_t\,Q(S_{t+n},A_{t+n}) - Q(S_t,A_t) \right)^2\]
pop()[source]

Pop a single transition from the cache.

Returns:
S, A, Rn, In, S_next, A_next : tuple of arrays, batch_size=1

The returned tuple represents a batch of preprocessed transitions:

(S, A, Rn, In, S_next, A_next)

These are typically used for bootstrapped updates, e.g. minimizing the bootstrapped MSE:

\[\left( R^{(n)}_t + I^{(n)}_t\,Q(S_{t+n},A_{t+n}) - Q(S_t,A_t) \right)^2\]
reset()[source]

Reset the cache to the initial state.

Experience Replay
keras_gym.caching.ExperienceReplayBuffer A simple numpy implementation of an experience replay buffer.
class keras_gym.caching.ExperienceReplayBuffer(env, capacity, batch_size=32, bootstrap_n=1, gamma=0.99, random_seed=None)[source]

A simple numpy implementation of an experience replay buffer. This is written primarily with computer game environments (Atari) in mind.

It implements a generic experience replay buffer for environments in which individual observations (frames) are stacked to represent the state.

Parameters:
env : gym environment

The main gym environment. This is needed to infer the number of stacked frames num_frames as well as the number of actions num_actions.

capacity : positive int

The capacity of the experience replay buffer. DQN typically uses capacity=1000000.

batch_size : positive int, optional

The desired batch size of the sample.

bootstrap_n : positive int

The number of steps over which to delay bootstrapping, i.e. n-step bootstrapping.

gamma : float between 0 and 1

Reward discount factor.

random_seed : int or None

To get reproducible results.

add(s, a, r, done, episode_id)[source]

Add a transition to the experience replay buffer.

Parameters:
s : state

A single state observation.

a : action

A single action.

r : float

The observed rewards associated with this transition.

done : bool

Whether the episode has finished.

episode_id : int

The episode in which the transition took place. This is needed for generating consistent samples.

clear()[source]

Clear the experience replay buffer.

classmethod from_value_function(value_function, capacity, batch_size=32)[source]

Create a new instance by extracting some settings from a Q-function.

The settings that are extracted from the value function are: gamma, bootstrap_n and num_frames. The latter is taken from the value function’s env attribute.

Parameters:
value_function : value-function object

A state value function or a state-action value function.

capacity : positive int

The capacity of the experience replay buffer. DQN typically uses capacity=1000000.

batch_size : positive int, optional

The desired batch size of the sample.

Returns:
experience_replay_buffer

A new instance.

sample()[source]

Get a batch of transitions to be used for bootstrapped updates.

Returns:
S, A, Rn, In, S_next, A_next : tuple of arrays

The returned tuple represents a batch of preprocessed transitions:

(S, A, Rn, In, S_next, A_next)

These are typically used for bootstrapped updates, e.g. minimizing the bootstrapped MSE:

\[\left( R^{(n)}_t + I^{(n)}_t\,\sum_aP(a|S_{t+n})\,Q(S_{t+n},a) - \sum_aP(a|S_t)\,Q(S_t,a) \right)^2\]

Planning

keras-gym provides planning methods. The only planning method that is currently implemented is the variant of Monte Carlo tree search (MCTS) that is used in AlphaZero. The goal is to implement more planning methods in the near future.

Objects

Wrappers

OpenAI gym provides a nice modular interface to extend existing environments using environment wrappers. Here we list some wrappers that are used throughout the keras-gym package.

Preprocessors

The default preprocessor tries to create a feature vector from any environment state observation on a best-effort basis. For instance, if the observation space is discrete \(s\in\{0, 1, \dots, n-1\}\), it will create a one-hot encoded vector such that the wrapped environment yields state observations \(s\in\mathbb{R}^n\).

import gym
import keras_gym as km

env = gym.make('FrozenLake-v0')
env = km.wrappers.DefaultPreprocessor(env)

s = env.unwrapped.reset()  # s == 0
s = env.reset()            # s == [1, 0, 0, ..., 0]

Other preprocessors that are particularly useful when dealing with video input are ImagePreprocessor and FrameStacker. For instance, for Atari 2600 environments we usually apply preprocessing as follows:

env = gym.make('PongDeterministic-v4')
env = km.wrappers.ImagePreprocessor(env, height=105, width=80, grayscale=True)
env = km.wrappers.FrameStacker(env, num_frames=4)

s = env.unwrapped.reset()  # s.shape == (210, 160, 3)
s = env.reset()            # s.shape == (105,  80, 4)

The first wrapper does down-scaling and grayscaling on each input frame. The second wrapper then stacks consecutive frames together, which allows for the function approximator to learn velocities/accelerations as well as positions for each input pixel.

Monitors

Another type of environment wrapper is a monitor, which is used to keep track of the progress of the training process. At the moment, keras-gym only provides a generic train monitor called TrainMonitor

Objects

Preprocessors
keras_gym.wrappers.BoxActionsToReals This wrapper decompactifies a Box action space to the reals.
keras_gym.wrappers.ImagePreprocessor Preprocessor for images.
keras_gym.wrappers.FrameStacker Stack multiple frames into one state observation.
class keras_gym.wrappers.BoxActionsToReals(env)[source]

This wrapper decompactifies a Box action space to the reals. This is required in order to be able to use a GaussianPolicy.

In practice, the wrapped environment expects the input action \(a_\text{real}\in\mathbb{R}^n\) and then it compactifies it back to a Box of the right size:

\[a_\text{box}\ =\ \text{low} + (\text{high}-\text{low}) \times\text{sigmoid}(a_\text{real})\]

Technically, the transformed space is still a Box, but that’s only because we assume that the values lie between large but finite bounds, \(a_\text{real}\in[10^{-15}, 10^{15}]^n\).

close()

Override close in your subclass to perform any necessary cleanup.

Environments will automatically close() themselves when garbage collected or when the program exits.

render(mode='human', **kwargs)

Renders the environment.

The set of supported modes varies per environment. (And some environments do not support rendering at all.) By convention, if mode is:

  • human: render to the current display or terminal and return nothing. Usually for human consumption.
  • rgb_array: Return an numpy.ndarray with shape (x, y, 3), representing RGB values for an x-by-y pixel image, suitable for turning into a video.
  • ansi: Return a string (str) or StringIO.StringIO containing a terminal-style text representation. The text can include newlines and ANSI escape sequences (e.g. for colors).
Note:
Make sure that your class’s metadata ‘render.modes’ key includes
the list of supported modes. It’s recommended to call super() in implementations to use the functionality of this method.
Args:
mode (str): the mode to render with

Example:

class MyEnv(Env):

metadata = {‘render.modes’: [‘human’, ‘rgb_array’]}

def render(self, mode=’human’):
if mode == ‘rgb_array’:
return np.array(…) # return RGB frame suitable for video
elif mode == ‘human’:
… # pop up a window and render
else:
super(MyEnv, self).render(mode=mode) # just raise an exception
reset(**kwargs)

Resets the environment to an initial state and returns an initial observation.

Note that this function should not reset the environment’s random number generator(s); random variables in the environment’s state should be sampled independently between multiple calls to reset(). In other words, each call of reset() should yield an environment suitable for a new episode, independent of previous episodes.

Returns:
observation (object): the initial observation.
seed(seed=None)

Sets the seed for this env’s random number generator(s).

Note:
Some environments use multiple pseudorandom number generators. We want to capture all such seeds used in order to ensure that there aren’t accidental correlations between multiple generators.
Returns:
list<bigint>: Returns the list of seeds used in this env’s random
number generators. The first value in the list should be the “main” seed, or the value which a reproducer should pass to ‘seed’. Often, the main seed equals the provided ‘seed’, but this won’t be true if seed=None, for example.
step(a)[source]

Run one timestep of the environment’s dynamics. When end of episode is reached, you are responsible for calling reset() to reset this environment’s state.

Accepts an action and returns a tuple (observation, reward, done, info).

Args:
action (object): an action provided by the agent
Returns:
observation (object): agent’s observation of the current environment reward (float) : amount of reward returned after previous action done (bool): whether the episode has ended, in which case further step() calls will return undefined results info (dict): contains auxiliary diagnostic information (helpful for debugging, and sometimes learning)
unwrapped

Completely unwrap this env.

Returns:
gym.Env: The base non-wrapped gym.Env instance
class keras_gym.wrappers.ImagePreprocessor(env, height, width, grayscale=True, assert_input_shape=None)[source]

Preprocessor for images.

This preprocessing is adapted from this blog post:

Parameters:
env : gym environment

A gym environment.

height : positive int

Output height (number of pixels).

width : positive int

Output width (number of pixels).

grayscale : bool, optional

Whether to convert RGB image to grayscale.

assert_input_shape : shape tuple, optional

If provided, the preprocessor will assert the given input shape.

close()

Override close in your subclass to perform any necessary cleanup.

Environments will automatically close() themselves when garbage collected or when the program exits.

render(mode='human', **kwargs)

Renders the environment.

The set of supported modes varies per environment. (And some environments do not support rendering at all.) By convention, if mode is:

  • human: render to the current display or terminal and return nothing. Usually for human consumption.
  • rgb_array: Return an numpy.ndarray with shape (x, y, 3), representing RGB values for an x-by-y pixel image, suitable for turning into a video.
  • ansi: Return a string (str) or StringIO.StringIO containing a terminal-style text representation. The text can include newlines and ANSI escape sequences (e.g. for colors).
Note:
Make sure that your class’s metadata ‘render.modes’ key includes
the list of supported modes. It’s recommended to call super() in implementations to use the functionality of this method.
Args:
mode (str): the mode to render with

Example:

class MyEnv(Env):

metadata = {‘render.modes’: [‘human’, ‘rgb_array’]}

def render(self, mode=’human’):
if mode == ‘rgb_array’:
return np.array(…) # return RGB frame suitable for video
elif mode == ‘human’:
… # pop up a window and render
else:
super(MyEnv, self).render(mode=mode) # just raise an exception
reset()[source]

Resets the environment to an initial state and returns an initial observation.

Note that this function should not reset the environment’s random number generator(s); random variables in the environment’s state should be sampled independently between multiple calls to reset(). In other words, each call of reset() should yield an environment suitable for a new episode, independent of previous episodes.

Returns:
observation (object): the initial observation.
seed(seed=None)

Sets the seed for this env’s random number generator(s).

Note:
Some environments use multiple pseudorandom number generators. We want to capture all such seeds used in order to ensure that there aren’t accidental correlations between multiple generators.
Returns:
list<bigint>: Returns the list of seeds used in this env’s random
number generators. The first value in the list should be the “main” seed, or the value which a reproducer should pass to ‘seed’. Often, the main seed equals the provided ‘seed’, but this won’t be true if seed=None, for example.
step(a)[source]

Run one timestep of the environment’s dynamics. When end of episode is reached, you are responsible for calling reset() to reset this environment’s state.

Accepts an action and returns a tuple (observation, reward, done, info).

Args:
action (object): an action provided by the agent
Returns:
observation (object): agent’s observation of the current environment reward (float) : amount of reward returned after previous action done (bool): whether the episode has ended, in which case further step() calls will return undefined results info (dict): contains auxiliary diagnostic information (helpful for debugging, and sometimes learning)
unwrapped

Completely unwrap this env.

Returns:
gym.Env: The base non-wrapped gym.Env instance
class keras_gym.wrappers.FrameStacker(env, num_frames=4)[source]

Stack multiple frames into one state observation.

Parameters:
env : gym environment

A gym environment.

num_frames : positive int, optional

Number of frames to stack in order to build a state feature vector.

close()

Override close in your subclass to perform any necessary cleanup.

Environments will automatically close() themselves when garbage collected or when the program exits.

render(mode='human', **kwargs)

Renders the environment.

The set of supported modes varies per environment. (And some environments do not support rendering at all.) By convention, if mode is:

  • human: render to the current display or terminal and return nothing. Usually for human consumption.
  • rgb_array: Return an numpy.ndarray with shape (x, y, 3), representing RGB values for an x-by-y pixel image, suitable for turning into a video.
  • ansi: Return a string (str) or StringIO.StringIO containing a terminal-style text representation. The text can include newlines and ANSI escape sequences (e.g. for colors).
Note:
Make sure that your class’s metadata ‘render.modes’ key includes
the list of supported modes. It’s recommended to call super() in implementations to use the functionality of this method.
Args:
mode (str): the mode to render with

Example:

class MyEnv(Env):

metadata = {‘render.modes’: [‘human’, ‘rgb_array’]}

def render(self, mode=’human’):
if mode == ‘rgb_array’:
return np.array(…) # return RGB frame suitable for video
elif mode == ‘human’:
… # pop up a window and render
else:
super(MyEnv, self).render(mode=mode) # just raise an exception
reset()[source]

Resets the environment to an initial state and returns an initial observation.

Note that this function should not reset the environment’s random number generator(s); random variables in the environment’s state should be sampled independently between multiple calls to reset(). In other words, each call of reset() should yield an environment suitable for a new episode, independent of previous episodes.

Returns:
observation (object): the initial observation.
seed(seed=None)

Sets the seed for this env’s random number generator(s).

Note:
Some environments use multiple pseudorandom number generators. We want to capture all such seeds used in order to ensure that there aren’t accidental correlations between multiple generators.
Returns:
list<bigint>: Returns the list of seeds used in this env’s random
number generators. The first value in the list should be the “main” seed, or the value which a reproducer should pass to ‘seed’. Often, the main seed equals the provided ‘seed’, but this won’t be true if seed=None, for example.
step(a)[source]

Run one timestep of the environment’s dynamics. When end of episode is reached, you are responsible for calling reset() to reset this environment’s state.

Accepts an action and returns a tuple (observation, reward, done, info).

Args:
action (object): an action provided by the agent
Returns:
observation (object): agent’s observation of the current environment reward (float) : amount of reward returned after previous action done (bool): whether the episode has ended, in which case further step() calls will return undefined results info (dict): contains auxiliary diagnostic information (helpful for debugging, and sometimes learning)
unwrapped

Completely unwrap this env.

Returns:
gym.Env: The base non-wrapped gym.Env instance
Monitors
keras_gym.wrappers.TrainMonitor Environment wrapper for monitoring the training process.
class keras_gym.wrappers.TrainMonitor(env, tensorboard_dir=None)[source]

Environment wrapper for monitoring the training process.

This wrapper logs some diagnostics at the end of each episode and it also gives us some handy attributes (listed below).

Parameters:
env : gym environment

A gym environment.

tensorboard_dir : str, optional

If provided, TrainMonitor will log all diagnostics to be viewed in tensorboard. To view these, point tensorboard to the same dir:

$ tensorboard --logdir {tensorboard_dir}
Attributes:
T : positive int

Global step counter. This is not reset by env.reset(), use env.reset_global() instead.

ep : positive int

Global episode counter. This is not reset by env.reset(), use env.reset_global() instead.

t : positive int

Step counter within an episode.

G : float

The return, i.e. amount of reward accumulated from the start of the current episode.

avg_G : float

The average return G, averaged over the past 100 episodes.

dt_ms : float

The average wall time of a single step, in milliseconds.

close()

Override close in your subclass to perform any necessary cleanup.

Environments will automatically close() themselves when garbage collected or when the program exits.

record_losses(losses)[source]

Record losses during the training process.

These are used to print more diagnostics.

Parameters:
losses : dict

A dict of losses/metrics, of type {name <str>: value <float>}.

render(mode='human', **kwargs)

Renders the environment.

The set of supported modes varies per environment. (And some environments do not support rendering at all.) By convention, if mode is:

  • human: render to the current display or terminal and return nothing. Usually for human consumption.
  • rgb_array: Return an numpy.ndarray with shape (x, y, 3), representing RGB values for an x-by-y pixel image, suitable for turning into a video.
  • ansi: Return a string (str) or StringIO.StringIO containing a terminal-style text representation. The text can include newlines and ANSI escape sequences (e.g. for colors).
Note:
Make sure that your class’s metadata ‘render.modes’ key includes
the list of supported modes. It’s recommended to call super() in implementations to use the functionality of this method.
Args:
mode (str): the mode to render with

Example:

class MyEnv(Env):

metadata = {‘render.modes’: [‘human’, ‘rgb_array’]}

def render(self, mode=’human’):
if mode == ‘rgb_array’:
return np.array(…) # return RGB frame suitable for video
elif mode == ‘human’:
… # pop up a window and render
else:
super(MyEnv, self).render(mode=mode) # just raise an exception
reset()[source]

Resets the environment to an initial state and returns an initial observation.

Note that this function should not reset the environment’s random number generator(s); random variables in the environment’s state should be sampled independently between multiple calls to reset(). In other words, each call of reset() should yield an environment suitable for a new episode, independent of previous episodes.

Returns:
observation (object): the initial observation.
reset_global()[source]

Reset the global counters, not just the episodic ones.

seed(seed=None)

Sets the seed for this env’s random number generator(s).

Note:
Some environments use multiple pseudorandom number generators. We want to capture all such seeds used in order to ensure that there aren’t accidental correlations between multiple generators.
Returns:
list<bigint>: Returns the list of seeds used in this env’s random
number generators. The first value in the list should be the “main” seed, or the value which a reproducer should pass to ‘seed’. Often, the main seed equals the provided ‘seed’, but this won’t be true if seed=None, for example.
step(a)[source]

Run one timestep of the environment’s dynamics. When end of episode is reached, you are responsible for calling reset() to reset this environment’s state.

Accepts an action and returns a tuple (observation, reward, done, info).

Args:
action (object): an action provided by the agent
Returns:
observation (object): agent’s observation of the current environment reward (float) : amount of reward returned after previous action done (bool): whether the episode has ended, in which case further step() calls will return undefined results info (dict): contains auxiliary diagnostic information (helpful for debugging, and sometimes learning)
unwrapped

Completely unwrap this env.

Returns:
gym.Env: The base non-wrapped gym.Env instance

Environments

This is a collection of environments currently not included in OpenAI Gym.

Self-Play Environments

These environments are typically games. They are implemented in such a way that can be played from a single-player perspective. The environment switches the current player and opponent between turns. The way to picture this is that the environment swaps color of all pieces between turns, so that the agent always gets the perspective of the player whose turn it is. The first such environment we include is the ConnectFourEnv.

Objects

Self-Play Environments
keras_gym.envs.ConnectFourEnv An adversarial environment for playing the Connect-Four game.
class keras_gym.envs.ConnectFourEnv[source]

An adversarial environment for playing the Connect-Four game.

Attributes:
action_space : gym.spaces.Discrete(7)

The action space.

observation_space : MultiDiscrete(nvec)

The state observation space, representing the position of the current player’s tokens (s[1:,:,0]) and the other player’s tokens (s[1:,:,1]) as well as a mask over the space of actions, indicating which actions are available to the current player (s[0,:,0]) or the other player (s[0,:,1]).

Note: The “current” player is relative to whose turn it is, which means that the entries s[:,:,0] and s[:,:,1] swap between turns.

max_time_steps : int

Maximum number of timesteps within each episode.

available_actions : array of int

Array of available actions. This list shrinks when columns saturate.

win_reward : 1.0

The reward associated with a win.

loss_reward : -1.0

The reward associated with a loss.

draw_reward : 0.0

The reward associated with a draw.

close()

Override close in your subclass to perform any necessary cleanup.

Environments will automatically close() themselves when garbage collected or when the program exits.

render(*args, **kwargs)[source]

Render the current state of the environment.

reset()[source]

Reset the environment to the starting position.

Returns:
s : 3d-array, shape: [num_rows + 1, num_cols, num_players]

A state observation, representing the position of the current player’s tokens (s[1:,:,0]) and the other player’s tokens (s[1:,:,1]) as well as a mask over the space of actions, indicating which actions are available to the current player (s[0,:,0]) or the other player (s[0,:,1]).

Note: The “current” player is relative to whose turn it is, which means that the entries s[:,:,0] and s[:,:,1] swap between turns.

seed(seed=None)

Sets the seed for this env’s random number generator(s).

Note:
Some environments use multiple pseudorandom number generators. We want to capture all such seeds used in order to ensure that there aren’t accidental correlations between multiple generators.
Returns:
list<bigint>: Returns the list of seeds used in this env’s random
number generators. The first value in the list should be the “main” seed, or the value which a reproducer should pass to ‘seed’. Often, the main seed equals the provided ‘seed’, but this won’t be true if seed=None, for example.
step(a)[source]

Take one step in the MDP, following the single-player convention from gym.

Parameters:
a : int, options: {0, 1, 2, 3, 4, 5, 6}

The action to be taken. The action is the zero-based count of the possible insertion slots, starting from the left of the board.

Returns:
s_next : array, shape [6, 7, 2]

A next-state observation, representing the position of the current player’s tokens (s[1:,:,0]) and the other player’s tokens (s[1:,:,1]) as well as a mask over the space of actions, indicating which actions are available to the current player (s[0,:,0]) or the other player (s[0,:,1]).

Note: The “current” player is relative to whose turn it is, which means that the entries s[:,:,0] and s[:,:,1] swap between turns.

r : float

Reward associated with the transition \((s, a)\to s_\text{next}\).

Note: Since “current” player is relative to whose turn it is, you need to be careful about aligning the rewards with the correct state or state-action pair. In particular, this reward \(r\) is the one associated with the \(s\) and \(a\), i.e. not aligned with \(s_\text{next}\).

done : bool

Whether the episode is done.

info : dict or None

A dict with some extra information (or None).

unwrapped

Completely unwrap this env.

Returns:
gym.Env: The base non-wrapped gym.Env instance

Loss Functions

This is a collection of custom keras-compatible loss functions that are used throughout this package.

Note

These functions generally require the Tensorflow backend.

Value Losses

These loss functions can be applied to learning a value function. Most of the losses are actually already provided by keras. The value-function losses included here are minor adaptations of the available keras losses.

Policy Losses

The way policy losses are implemented is slightly different from value losses due to their non-standard structure. A policy loss is implemented in a method on updateable policy objects (see below). If you need to implement a custom policy loss, you can override this policy_loss_with_metrics() method.

BaseUpdateablePolicy.policy_loss_with_metrics(Adv, A=None)[source]

This method constructs the policy loss as a scalar-valued Tensor, together with a dictionary of metrics (also scalars).

This method may be overridden to construct a custom policy loss and/or to change the accompanying metrics.

Parameters:
Adv : 1d Tensor, shape: [batch_size]

A batch of advantages.

A : nd Tensor, shape: [batch_size, …]

A batch of actions taken under the behavior policy. For some choices of policy loss, e.g. update_strategy='sac' this input is ignored.

Returns:
loss, metrics : (Tensor, dict of Tensors)

The policy loss along with some metrics, which is a dict of type {name <str>: metric <Tensor>}. The loss and each of the metrics (dict values) are scalar Tensors, i.e. Tensors with ndim=0.

The loss is passed to a keras Model using train_model.add_loss(loss). Similarly, each metric in the metric dict is passed to the model using train_model.add_metric(metric, name=name, aggregation='mean').

Objects

Value Losses
keras_gym.losses.ProjectedSemiGradientLoss Loss function for type-II Q-function.
keras_gym.losses.RootMeanSquaredError Root-mean-squared error (RMSE) loss.
keras_gym.losses.LoglossSign Logloss implemented for predicted logits \(z\in\mathbb{R}\) and ground truth \(y\pm1\).
class keras_gym.losses.ProjectedSemiGradientLoss(G, base_loss=<tensorflow.python.keras.losses.Huber object>)[source]

Loss function for type-II Q-function.

This loss function projects the predictions \(q(s, .)\) onto the actions for which we actually received a feedback signal.

Parameters:
G : 1d Tensor, dtype: float, shape: [batch_size]

The returns that we wish to fit our value function on.

base_loss : keras loss function, optional

Keras loss function. Default: huber_loss.

__call__(A, Q_pred, sample_weight=None)[source]

Compute the projected MSE.

Parameters:
A : 2d Tensor, dtype: int, shape: [batch_size, num_actions]

A batch of (one-hot encoded) discrete actions A.

Q_pred : 2d Tensor, shape: [batch_size, num_actions]

The predicted values \(q(s,.)\), a.k.a. y_pred.

sample_weight : Tensor, dtype: float, optional

Tensor whose rank is either 0 or is broadcastable to y_true. sample_weight acts as a coefficient for the loss. If a scalar is provided, then the loss is simply scaled by the given value. If sample_weight is a tensor of size [batch_size], then the total loss for each sample of the batch is rescaled by the corresponding element in the sample_weight vector.

Returns:
loss : 0d Tensor (scalar)

The batch loss.

call(y_true, y_pred)

Invokes the Loss instance.

Args:
y_true: Ground truth values. shape = [batch_size, d0, .. dN], except
sparse loss functions such as sparse categorical crossentropy where shape = [batch_size, d0, .. dN-1]

y_pred: The predicted values. shape = [batch_size, d0, .. dN]

Returns:
Loss values with the shape [batch_size, d0, .. dN-1].
get_config()

Returns the config dictionary for a Loss instance.

class keras_gym.losses.RootMeanSquaredError(delta=1.0, name='root_mean_squared_error')[source]

Root-mean-squared error (RMSE) loss.

Parameters:
name : str, optional

Optional name for the op.

__call__(y_true, y_pred, sample_weight=None)[source]

Compute the RMSE loss.

Parameters:
y_true : Tensor, shape: [batch_size, …]

Ground truth values.

y_pred : Tensor, shape: [batch_size, …]

The predicted values.

sample_weight : Tensor, dtype: float, optional

Tensor whose rank is either 0, or the same rank as y_true, or is broadcastable to y_true. sample_weight acts as a coefficient for the loss. If a scalar is provided, then the loss is simply scaled by the given value. If sample_weight is a tensor of size [batch_size], then the total loss for each sample of the batch is rescaled by the corresponding element in the sample_weight vector. If the shape of sample_weight matches the shape of y_pred, then the loss of each measurable element of y_pred is scaled by the corresponding value of sample_weight.

Returns:
loss : 0d Tensor (scalar)

The batch loss.

call(y_true, y_pred)

Invokes the Loss instance.

Args:
y_true: Ground truth values. shape = [batch_size, d0, .. dN], except
sparse loss functions such as sparse categorical crossentropy where shape = [batch_size, d0, .. dN-1]

y_pred: The predicted values. shape = [batch_size, d0, .. dN]

Returns:
Loss values with the shape [batch_size, d0, .. dN-1].
get_config()

Returns the config dictionary for a Loss instance.

class keras_gym.losses.LoglossSign[source]

Logloss implemented for predicted logits \(z\in\mathbb{R}\) and ground truth \(y\pm1\).

\[L\ =\ \log\left( 1 + \exp(-y\,z) \right)\]
__call__(y_true, z_pred, sample_weight=None)[source]
Parameters:
y_true : Tensor, shape: [batch_size, …]

Ground truth values \(y\pm1\).

z_pred : Tensor, shape: [batch_size, …]

The predicted logits \(z\in\mathbb{R}\).

sample_weight : Tensor, dtype: float, optional

Not yet implemented.

#TODO: implement this

call(y_true, y_pred)

Invokes the Loss instance.

Args:
y_true: Ground truth values. shape = [batch_size, d0, .. dN], except
sparse loss functions such as sparse categorical crossentropy where shape = [batch_size, d0, .. dN-1]

y_pred: The predicted values. shape = [batch_size, d0, .. dN]

Returns:
Loss values with the shape [batch_size, d0, .. dN-1].
get_config()

Returns the config dictionary for a Loss instance.

Utilities

The helper functions are organized by what objects they act on. The three categories are tensor helpers, numpy-array helpers and miscellaneous.

Objects

Miscellaneous Utilities
keras_gym.utils.enable_logging Enable logging output.
keras_gym.utils.generate_gif Store a gif from the episode frames.
keras_gym.utils.get_env_attr Get the given attribute from a potentially wrapped environment.
keras_gym.utils.get_transition Generate a transition from the environment.
keras_gym.utils.has_env_attr Check if a potentially wrapped environment has a given attribute.
keras_gym.utils.is_policy Check whether an object is an (updateable) policy.
keras_gym.utils.is_qfunction Check whether an object is a state-action value function, or Q-function.
keras_gym.utils.is_vfunction Check whether an object is a state value function, or V-function.
keras_gym.utils.render_episode Run a single episode with env.render() calls with each time step.
keras_gym.utils.set_tf_loglevel Set the logging level for Tensorflow logger.
keras_gym.utils.enable_logging(level=20, level_tf=40)[source]

Enable logging output.

This executes the following two lines of code:

import logging
logging.basicConfig(level=logging.INFO)
set_tf_loglevel(logging.ERROR)

Note that set_tf_loglevel() is another keras-gym utility function.

Parameters:
level : int, optional

Log level for native python logging. For instance, if you’d like to see more verbose logging messages you might set level=logging.DEBUG.

level_tf : int, optional

Log level for tensorflow-specific logging (logs coming from the C++ layer).

keras_gym.utils.generate_gif(env, policy, filepath, resize_to=None, duration=50)[source]

Store a gif from the episode frames.

Parameters:
env : gym environment

The environment to record from.

policy : keras-gym policy object

The policy that is used to take actions.

filepath : str

Location of the output gif file.

resize_to : tuple of ints, optional

The size of the output frames, (width, height). Notice the ordering: first width, then height. This is the convention PIL uses.

duration : float, optional

Time between frames in the animated gif, in milliseconds.

keras_gym.utils.get_env_attr(env, attr, default='__ERROR__', max_depth=100)[source]

Get the given attribute from a potentially wrapped environment.

Note that the wrapped envs are traversed from the outside in. Once the attribute is found, the search stops. This means that an inner wrapped env may carry the same (possibly conflicting) attribute. This situation is not resolved by this function.

Parameters:
env : gym environment

A potentially wrapped environment.

attr : str

The attribute name.

max_depth : positive int, optional

The maximum depth of wrappers to traverse.

keras_gym.utils.get_transition(env)[source]

Generate a transition from the environment.

This basically does a single step on the environment and then closes it.

Parameters:
env : gym environment

A gym environment.

Returns:
s, a, r, s_next, a_next, done, info : tuple

A single transition. Note that the order and the number of items returned is different from what env.reset() return.

keras_gym.utils.has_env_attr(env, attr, max_depth=100)[source]

Check if a potentially wrapped environment has a given attribute.

Parameters:
env : gym environment

A potentially wrapped environment.

attr : str

The attribute name.

max_depth : positive int, optional

The maximum depth of wrappers to traverse.

keras_gym.utils.is_policy(obj, check_updateable=False)[source]

Check whether an object is an (updateable) policy.

Parameters:
obj

Object to check.

check_updateable : bool, optional

If the obj is a policy, also check whether or not the policy is updateable.

Returns:
bool

Whether obj is a (updateable) policy.

keras_gym.utils.is_qfunction(obj, qtype=None)[source]

Check whether an object is a state-action value function, or Q-function.

Parameters:
obj

Object to check.

qtype : 1 or 2, optional

Check for specific Q-function type, i.e. type-I or type-II.

Returns:
bool

Whether obj is a (type-I/II) Q-function.

keras_gym.utils.is_vfunction(obj)[source]

Check whether an object is a state value function, or V-function.

Parameters:
obj

Object to check.

Returns:
bool

Whether obj is a V-function.

keras_gym.utils.render_episode(env, policy, step_delay_ms=0)[source]

Run a single episode with env.render() calls with each time step.

Parameters:
env : gym environment

A gym environment.

policy : callable

A policy objects that is used to pick actions: a = policy(s).

step_delay_ms : non-negative float

The number of milliseconds to wait between consecutive timesteps. This can be used to slow down the rendering.

keras_gym.utils.set_tf_loglevel(level)[source]

Set the logging level for Tensorflow logger. This also sets the logging level of the underlying C++ layer.

Parameters:
level : int

A logging level as provided by the builtin logging module, e.g. level=logging.INFO.

Numpy-Array Utilities
keras_gym.utils.argmax This is a little hack to ensure that argmax breaks ties randomly, which is something that numpy.argmax() doesn’t do.
keras_gym.utils.argmin This is a little hack to ensure that argmin breaks ties randomly, which is something that numpy.argmin() doesn’t do.
keras_gym.utils.box_to_reals_np Transform array values from a Box space to the reals.
keras_gym.utils.box_to_unit_interval_np Rescale array values from Box space to the unit interval.
keras_gym.utils.check_numpy_array This helper function is mostly for internal use.
keras_gym.utils.clipped_logit_np A safe implementation of the logit function \(x\mapsto\log(x/(1-x))\).
keras_gym.utils.feature_vector Create a feature vector out of a state observation \(s\) or an action \(a\).
keras_gym.utils.idx Given a numpy array, return its corresponding integer index array.
keras_gym.utils.log_softmax Compute the log-softmax.
keras_gym.utils.one_hot Create a dense one-hot encoded vector.
keras_gym.utils.project_onto_actions_np Project tensor onto specific actions taken: numpy implementation.
keras_gym.utils.reals_to_box_np Transform array values from the reals to a Box space.
keras_gym.utils.softmax Compute the softmax (normalized point-wise exponential).
keras_gym.utils.unit_interval_to_box_np Rescale array values from the unit interval to a Box space.
keras_gym.utils.argmax(arr, axis=-1, random_state=None)[source]

This is a little hack to ensure that argmax breaks ties randomly, which is something that numpy.argmax() doesn’t do.

Note: random tie breaking is only done for 1d arrays; for multidimensional inputs, we fall back to the numpy version.

Parameters:
a : array_like

Input array.

axis : int, optional

By default, the index is into the flattened array, otherwise along the specified axis.

random_state : int or RandomState

This can either be a random seed (int) or an instance of numpy.random.RandomState.

Returns:
index_array : ndarray of ints

Array of indices into the array. It has the same shape as a.shape with the dimension along axis removed.

keras_gym.utils.argmin(arr, axis=None, random_state=None)[source]

This is a little hack to ensure that argmin breaks ties randomly, which is something that numpy.argmin() doesn’t do.

Note: random tie breaking is only done for 1d arrays; for multidimensional inputs, we fall back to the numpy version.

Parameters:
a : array_like

Input array.

axis : int, optional

By default, the index is into the flattened array, otherwise along the specified axis.

random_state : int or RandomState

This can either be a random seed (int) or an instance of numpy.random.RandomState.

Returns:
index_array : ndarray of ints

Array of indices into the array. It has the same shape as a.shape with the dimension along axis removed.

keras_gym.utils.box_to_reals_np(arr, space, epsilon=1e-15)[source]

Transform array values from a Box space to the reals. This is done by first mapping the Box values to the unit interval \(x\in[0, 1]\) and then feeding it to the clipped_logit_np() function.

Parameters:
arr : nd array

A numpy array containing a single instance or a batch of elements of a Box space.

space : gym.spaces.Box

The Box space. This is needed to determine the shape and size of the space.

epsilon : float, optional

The cut-off value used by clipped_logit_np().

Returns:
out : nd array, same shape as input

A numpy array with the transformed values. The output values are real-valued.

keras_gym.utils.box_to_unit_interval_np(arr, space)[source]

Rescale array values from Box space to the unit interval. This is essentially just min-max scaling:

\[x\ \mapsto\ \frac{x-x_\text{low}}{x_\text{high}-x_\text{low}}\]
Parameters:
arr : nd array

A numpy array containing a single instance or a batch of elements of a Box space.

space : gym.spaces.Box

The Box space. This is needed to determine the shape and size of the space.

Returns:
out : nd array, same shape as input

A numpy array with the transformed values. The output values lie on the unit interval \([0, 1]\).

keras_gym.utils.check_numpy_array(arr, ndim=None, ndim_min=None, dtype=None, shape=None, axis_size=None, axis=None)[source]

This helper function is mostly for internal use. It is used to check a few common properties of a numpy array.

Raises:
NumpyArrayCheckError

If one of the checks fails, it raises a NumpyArrayCheckError.

keras_gym.utils.clipped_logit_np(x, epsilon=1e-15)[source]

A safe implementation of the logit function \(x\mapsto\log(x/(1-x))\). It clips the arguments of the log function from below so as to avoid evaluating it at 0:

\[\text{logit}_\epsilon(x)\ =\ \log(\max(\epsilon, x)) - \log(\max(\epsilon, 1 - x))\]
Parameters:
x : nd array

Input numpy array whose entries lie on the unit interval, \(x_i\in [0, 1]\).

epsilon : float, optional

The small number with which to clip the arguments of the logarithm from below.

Returns:
z : nd array, dtype: float, shape: same as input

The output logits whose entries lie on the real line, \(z_i\in\mathbb{R}\).

keras_gym.utils.feature_vector(x, space)[source]

Create a feature vector out of a state observation \(s\) or an action \(a\). This is used in the DefaultPreprocessor.

Parameters:
x : state or action

A state observation \(s\) or an action \(a\).

space : gym space

A gym space, e.g. gym.spaces.Box, gym.spaces.Discrete, etc.

keras_gym.utils.idx(arr, axis=0)[source]

Given a numpy array, return its corresponding integer index array.

Parameters:
arr : array

Input array.

axis : int, optional

The axis along which we’d like to get an index.

Returns:
index : 1d array, shape: arr.shape[axis]

An index array [0, 1, 2, …].

keras_gym.utils.log_softmax(arr, axis=-1)[source]

Compute the log-softmax.

Note: This is the numpy implementation.

Parameters:
arr : numpy array

The input array.

axis : int, optional

The axis along which to normalize, default is 0.

Returns:
out : array of same shape

The entries may be interpreted as log-probabilities.

keras_gym.utils.one_hot(i, n, dtype='float')[source]

Create a dense one-hot encoded vector.

Parameters:
i : int or 1d array of ints

The index of the non-zero entry.

n : int

The dimensionality of the dense vector. Note that n must be greater than i.

dtype : str or datatype

The output data type, default is ‘float’.

Returns:
x : 1d array of length n

The dense one-hot encoded vector.

keras_gym.utils.project_onto_actions_np(Y, A)[source]

Project tensor onto specific actions taken: numpy implementation.

Note: This only applies to discrete action spaces.

Parameters:
Y : 2d array, shape: [batch_size, num_actions]

The tensor to project down.

A : 1d array, shape: [batch_size]

The batch of actions used to project.

Returns:
Y_projected : 1d array, shape: [batch_size]

The tensor projected onto the actions taken.

keras_gym.utils.reals_to_box_np(arr, space)[source]

Transform array values from the reals to a Box space. This is done by first applying the logistic sigmoid to map the reals onto the unit interval and then applying unit_interval_to_box_np() to rescale to the Box space.

Parameters:
arr : nd array

A numpy array containing a single instance or a batch of elements of a Box space, encoded as logits.

space : gym.spaces.Box

The Box space. This is needed to determine the shape and size of the space.

Returns:
out : nd array, same shape as input

A numpy array with the transformed values. The output values are contained in the provided Box space.

keras_gym.utils.softmax(arr, axis=-1)[source]

Compute the softmax (normalized point-wise exponential).

Note: This is the numpy implementation.

Parameters:
arr : numpy array

The input array.

axis : int, optional

The axis along which to normalize, default is 0.

Returns:
out : array of same shape

The entries of the output array are non-negative and normalized, which make them good candidates for modeling probabilities.

keras_gym.utils.unit_interval_to_box_np(arr, space)[source]

Rescale array values from the unit interval to a Box space. This is essentially inverted min-max scaling:

\[x\ \mapsto\ x_\text{low} + (x_\text{high} - x_\text{low})\,x\]
Parameters:
arr : nd array

A numpy array containing a single instance or a batch of elements of a Box space, scaled to the unit interval.

space : gym.spaces.Box

The Box space. This is needed to determine the shape and size of the space.

Returns:
out : nd array, same shape as input

A numpy array with the transformed values. The output values are contained in the provided Box space.

Tensor Utilities
keras_gym.utils.box_to_reals_tf Transform Tensor values from a Box space to the reals.
keras_gym.utils.box_to_unit_interval_tf Rescale Tensor values from Box space to the unit interval.
keras_gym.utils.check_tensor This helper function is mostly for internal use.
keras_gym.utils.diff_transform_matrix A helper function that implements discrete differentiation for stacked state observations.
keras_gym.utils.log_softmax_tf Compute the log-softmax.
keras_gym.utils.project_onto_actions_tf Project tensor onto specific actions taken: tensorflow implementation.
keras_gym.utils.project_onto_actions_tf Project tensor onto specific actions taken: tensorflow implementation.
keras_gym.utils.unit_interval_to_box_tf Rescale Tensor values from the unit interval to a Box space.
keras_gym.utils.box_to_reals_tf(tensor, space, epsilon=1e-15)[source]

Transform Tensor values from a Box space to the reals. This is done by first mapping the Box values to the unit interval \(x\in[0, 1]\) and then feeding it to the clipped_logit_tf() function.

Parameters:
tensor : nd Tensor

A tensor containing a single instance or a batch of elements of a Box space.

space : gym.spaces.Box

The Box space. This is needed to determine the shape and size of the space.

epsilon : float, optional

The cut-off value used by clipped_logit_tf().

Returns:
out : nd Tensor, same shape as input

A Tensor with the transformed values. The output values are real-valued.

keras_gym.utils.box_to_unit_interval_tf(tensor, space)[source]

Rescale Tensor values from Box space to the unit interval. This is essentially just min-max scaling:

\[x\ \mapsto\ \frac{x-x_\text{low}}{x_\text{high}-x_\text{low}}\]
Parameters:
tensor : nd Tensor

A tensor containing a single instance or a batch of elements of a Box space.

space : gym.spaces.Box

The Box space. This is needed to determine the shape and size of the space.

Returns:
out : nd Tensor, same shape as input

A Tensor with the transformed values. The output values lie on the unit interval \([0,1]\).

keras_gym.utils.check_tensor(tensor, ndim=None, ndim_min=None, dtype=None, same_dtype_as=None, same_shape_as=None, same_as=None, int_shape=None, axis_size=None, axis=None)[source]

This helper function is mostly for internal use. It is used to check a few common properties of a Tensor.

Parameters:
ndim : int or list of ints

Check K.ndim(tensor).

ndim_min : int

Check if K.ndim(tensor) is at least ndim_min.

dtype : Tensor dtype or list of Tensor dtypes

Check tensor.dtype.

same_dtype_as : Tensor

Check if dtypes match.

same_shape_as : Tensor

Check if shapes match.

same_as : Tensor

Check if both dtypes and shapes match.

int_shape : tuple of ints

Check K.int_shape(tensor).

axis_size : int

Check size along axis, where axis is specified by axis=... kwarg.

axis : int

The axis the check for size.

Raises:
TensorCheckError

If one of the checks fails, it raises a TensorCheckError.

keras_gym.utils.diff_transform_matrix(num_frames, dtype='float32')[source]

A helper function that implements discrete differentiation for stacked state observations.

Let’s say we have a feature vector \(X\) consisting of four stacked frames, i.e. the shape would be: [batch_size, height, width, 4].

The corresponding diff-transform matrix with num_frames=4 is a \(4\times 4\) matrix given by:

\[\begin{split}M_\text{diff}^{(4)}\ =\ \begin{pmatrix} -1 & 0 & 0 & 0 \\ 3 & 1 & 0 & 0 \\ -3 & -2 & -1 & 0 \\ 1 & 1 & 1 & 1 \end{pmatrix}\end{split}\]

such that the diff-transformed feature vector is readily computed as:

\[X_\text{diff}\ =\ X\, M_\text{diff}^{(4)}\]

The diff-transformation preserves the shape, but it reorganizes the frames in such a way that they look more like canonical variables. You can think of \(X_\text{diff}\) as the stacked variables \(x\), \(\dot{x}\), \(\ddot{x}\), etc. (in reverse order). These represent the position, velocity, acceleration, etc. of pixels in a single frame.

Parameters:
num_frames : positive int

The number of stacked frames in the original \(X\).

dtype : keras dtype, optional

The output data type.

Returns:
M : 2d-Tensor, shape: [num_frames, num_frames]

A square matrix that is intended to be multiplied from the left, e.g. X_diff = K.dot(X_orig, M), where we assume that the frames are stacked in axis=-1 of X_orig, in chronological order.

keras_gym.utils.log_softmax_tf(Z, axis=-1)[source]

Compute the log-softmax.

Note: This is the tensorflow implementation.

Parameters:
Z : Tensor

The input logits.

axis : int, optional

The axis along which to normalize, default is 0.

Returns:
out : Tensor of same shape as input

The entries may be interpreted as log-probabilities.

keras_gym.utils.project_onto_actions_tf(Y, A)[source]

Project tensor onto specific actions taken: tensorflow implementation.

Note: This only applies to discrete action spaces.

Parameters:
Y : 2d Tensor, shape: [batch_size, num_actions]

The tensor to project down.

A : 1d Tensor, shape: [batch_size]

The batch of actions used to project.

Returns:
Y_projected : 1d Tensor, shape: [batch_size]

The tensor projected onto the actions taken.

keras_gym.utils.project_onto_actions_tf(Y, A)[source]

Project tensor onto specific actions taken: tensorflow implementation.

Note: This only applies to discrete action spaces.

Parameters:
Y : 2d Tensor, shape: [batch_size, num_actions]

The tensor to project down.

A : 1d Tensor, shape: [batch_size]

The batch of actions used to project.

Returns:
Y_projected : 1d Tensor, shape: [batch_size]

The tensor projected onto the actions taken.

keras_gym.utils.unit_interval_to_box_tf(tensor, space)[source]

Rescale Tensor values from the unit interval to a Box space. This is essentially inverted min-max scaling:

\[x\ \mapsto\ x_\text{low} + (x_\text{high} - x_\text{low})\,x\]
Parameters:
tensor : nd Tensor

A numpy array containing a single instance or a batch of elements of a Box space, scaled to the unit interval.

space : gym.spaces.Box

The Box space. This is needed to determine the shape and size of the space.

Returns:
out : nd Tensor, same shape as input

A Tensor with the transformed values. The output values are contained in the provided Box space.

Glossary

In this package we make heavy use of function approximators using keras.Model objects. In Section 1 we list the available types of function approximators. A function approximator uses multiple keras models to support its full functionality. The different types keras models are listed in Section 2. Finally, in Section 3 we list the different kinds of inputs and outputs that our keras models expect.

1. Function approximator types

function approximator
A function approximator is any object that can be updated.
body
The body is what we call the part of the computation graph that may be shared between e.g. policy (actor) and value function (critic). It is typlically the part of a neural net that does most of the heavy lifting. One may think of the body() as an elaborate automatic feature extractor.
head

The head is the part of the computation graph that actually generates the desired output format/shape. As its input, it takes the output of body. The different heads that FunctionApproximator class provides are:

head_v

This is the state value head. It returns a batch of scalar values V.

head_q1

This is the type-I Q-value head. It returns a batch of scalar values Q_sa.

head_q2

This is the type-II Q-value head. It returns a batch of vectors Q_s.

head_pi

This is the policy head. It returns a batch of distribution parameters Z.
forward_pass
This is just the consecutive application of head after body.

In this package we have four distinct types of function approximators:

state value function
State value functions \(v(s)\) are implemented by V.
type-I state-action value function

This is the standard state-action value function \(q(s,a)\). It models the Q-function as

\[(s, a) \mapsto q(s,a)\ \in\ \mathbb{R}\]

This function approximator is implemented by QTypeI.

type-II state-action value function

This type of state-action value function is different from type-I in that it models the Q-function as

\[s \mapsto q(s,.)\ \in\ \mathbb{R}^n\]

where \(n\) is the number of actions. The type-II Q-function is implemented by QTypeII.

updateable policy
This function approximator represents a policy directly. It is implemented by e.g. SoftmaxPolicy.
actor-critic
This is a special function approximator that allows for the sharing of parts of the computation graph between a value function (critic) and a policy (actor).

Note

At the moment, type-II Q-functions and updateable policies are only implemented for environments with a Discrete action space.

2. Keras model types

Now each function approximator takes multiple keras.Model objects. The different models are named according to role they play in the functions approximator object:

train_model
This keras.Model is used for training.
predict_model
This keras.Model is used for predicting.
target_model
This keras.Model is a kind of shadow copy of predict_model that is used in off-policy methods. For instance, in DQN we use it for reducing the variance of the bootstrapped target by synchronizing with predict_model only periodically.

Note

The specific input depends on the type of function approximator you’re using. These are provided in each individual class doc.

3. Keras model inputs/outputs

Each keras.Model object expects specific inputs and outputs. These are provided in each individual function approximator’s docs.

Below we list the different available arrays that we might use as inputs/outputs to our keras models.

S
A batch of (preprocessed) state observations. The shape is [batch_size, ...] where the ellipses might be any number of dimensions.
A
A batch of actions taken, with shape [batch_size].
P
A batch of distribution parameters that allow us to construct action propensities according to the behavior/target policy \(b(a|s)\). For instance, the parameters of a keras_gym.SoftmaxPolicy (for discrete actions spaces) are those of a categorical distribution. On the other hand, for continuous action spaces we use a keras_gym.GaussianPolicy, whose parameters are the parameters of the underlying normal distribution.
Z
Similar to P, this is a batch of distribution parameters. In contrast to P, however, Z represents the primary updateable policy \(\pi_\theta(a|s)\) instead of the behavior/target policy \(b(a|s)\).
G
A batch of (\(\gamma\)-discounted) returns, shape: [batch_size].
Rn

A batch of partial (\(\gamma\)-discounted) returns. For instance, in n-step bootstrapping these are given by:

\[R^{(n)}_t\ =\ R_t + \gamma\,R_{t+1} + \dots + \gamma^{n-1}\,R_{t+n-1}\]

In other words, it’s the part of the n-step return without the bootstrapping term. The shape is [batch_size].

In

A batch of bootstrap factors. For instance, in n-step bootstrapping these are given by \(I^{(n)}_t=\gamma^n\) when bootstrapping and \(I^{(n)}_t=0\) otherwise. It is used in bootstrapped updates. For instance, the n-step bootstrapped target makes use of it as follows:

\[G^{(n)}_t\ =\ R^{(n)}_t + I^{(n)}_t\,Q(S_{t+1}, A_{t+1})\]

The shape is [batch_size].

S_next
A batch of (preprocessed) next-state observations. This is typically used in bootstrapping (see In). The shape is [batch_size, ...] where the ellipses might be any number of dimensions.
A_next
A batch of next-actions to be taken. These can be actions that were actually taken (on-policy), but they can also be any other would-be next-actions (off-policy). The shape is [batch_size].
P_next
A batch of action propensities according to the policy \(\pi(a|s)\).
V
A batch of V-values \(v(s)\) of shape [batch_size].
Q_sa
A batch of Q-values \(q(s,a)\) of shape [batch_size].
Q_s
A batch of Q-values \(q(s,.)\) of shape [batch_size, num_actions].
Adv
A batch of advantages \(\mathcal{A}(s,a) = q(s,a) - v(s)\), which has shape: [batch_size].

Release Notes

v0.2.17

  • Made keras-gym compatible with tensorflow v2.0 (unfortunately had to disable eager mode)
  • Added SoftActorCritic class
  • Added frozen_lake/sac script and notebook
  • Added atari/sac script, which is still WIP

v0.2.16

Major update: support Box action spaces.

  • introduced keras_gym.proba_dists sub-module, which implements differentiable proability ditributions (incl. differentiable sample() methods)
  • removed policy-based losses in favor BaseUpdateablePolicy.policy_loss_with_metrics(), which now uses the differentiable ProbaDist objects
  • removed ConjointActorCritic (was redundant)
  • changed how we implement target models: no longer rely on global namespaces; instead we use keras.models.clone_model()
  • changed BaseFunctionApproximator.sync_target_model(): use model.{get,set}_weights()
  • added script and notebook for Pendulum-v0 with PPO

v0.2.15

This is a relatively minor update. Just a couple of small bug fixes.

  • fixed logging, which was broken by abseil (dependence of tensorflow>=1.14)
  • added enable_logging helper
  • updated some docs

v0.2.13

This version is another major overhaul. In particular, the FunctionApproximator class is introduced, which offers a unified interface for all function approximator types, i.e. state(-action) value functions and updateable policies. This makes it a lot easier to create your own custom function approximator, whereby you only ahve to define your own forward-pass by creating a subclass of FunctionApproximator and providing a body method. Further flexibility is provided by allowing the head method(s) to be overridden.

  • added FunctionApproximator class
  • refactored value functions and policies to just be a wrapper around a FunctionApproximator object
  • MILESTONE: got AlphaZero to work on ConnectFour (although this game is likely too simple to see the real power of AlphaZero - MCTS on its own works fine)

v0.2.12

  • MILESTONE: got PPO working on Atari Pong
  • added PolicyKLDivergence and PolicyEntropy
  • added entropy_beta and ppo_clip_eps kwargs to updateable policies

v0.2.11

  • optimized ActorCritic to avoid feeding in S three times instead of once
  • removed all mention of bootstrap_model
  • implemented PPO with ClippedSurrogateLoss

v0.2.10

This is the second overhaul, a complete rewrite in fact. There was just too much of the old scikit-gym structure that was standing in the way of progress.

The main thing that changed in this version is that I ditch the notion of an algorithm. Instead, function approximators carry their own “update strategy”. In the case of Q-functions, this is ‘sarsa’, ‘q_learning’ etc., while policies have the options ‘vanilla’, ‘ppo’, etc.

Value functions carry another property that was previously attributed to algorithm objects. This is the bootstrap-n, i.e. the number of steps over which to delay bootstrapping.

This new structure accommodates for modularity much much better than the old structure.

  • removed algorithms, replaced by ‘bootstrap_n’ and ‘update_strategy’ settings on function approximators
  • implemented ExperienceReplayBuffer
  • milestone: added DQN implementation for Atari 2600 envs.
  • other than that.. too much to mention. It really was a complete rewrite

v0.2.9

  • changed definitions of Q-functions to GenericQ and GenericQTypeII
  • added option for efficient bootstrapped updating (bootstrap_model argument in value functions, see example usage: NStepBootstrapV)
  • renamed ValuePolicy to ValueBasedPolicy

v0.2.8

  • implemented base class for updateable policy objects
  • implemented first example of updateable policy: GenericSoftmaxPolicy
  • implemented predefined softmax policy: LinearSoftmaxPolicy
  • added first policy gradient algorithm: Reinforce
  • added REINFORCE example notebook
  • updated documentation

v0.2.7

This was a MAJOR overhaul in which I ported everything from scikit-learn to Keras. The reason for this is that I was stuck on the implementation of policy gradient methods due to the lack of flexibility of the scikit-learn ecosystem. I chose Keras as a replacement, it’s nice an modular like scikit-learn, but in addition it’s much more flexible. In particular, the ability to provide custom loss functions has been the main selling point. Another selling point was that some environments require more sophisticated neural nets than a simple MLP, which is readily available in Keras.

  • added compatibility wrapper for scikit-learn function approximators
  • ported all value functions to use keras.Model
  • ported predefined models LinearV and LinearQ to keras
  • ported algorithms to keras
  • ported all notebooks to keras
  • changed name of the package keras-gym and root module keras_gym

Other changes:

  • added propensity score outputs to policy objects
  • created a stub for directly updateable policies

v0.2.6

  • refactored BaseAlgorithm to simplify implementation (at the cost of more code, but it’s worth it)
  • refactored notebooks: they are now bundled by environment / algo type
  • added n-step bootstrap algorithms:
    • NStepQLearning
    • NStepSarsa
    • NStepExpectedSarsa

v0.2.5

  • added algorithm: keras_gym.algorithms.ExpectedSarsa
  • added object: keras_gym.utils.ExperienceCache
  • rewrote MonteCarlo to use ExperienceCache

v0.2.4

  • added algorithm: keras_gym.algorithms.MonteCarlo

v0.2.3

  • added algorithm: keras_gym.algorithms.Sarsa

v0.2.2

  • changed doc theme from sklearn to readthedocs

v0.2.1

  • first working implementation value function + policy + algorithm
  • added first working example in a notebook
  • added algorithm: keras_gym.algorithms.QLearning

Indices and tables

Example

To get started, check out the Example Notebooks for examples. Alternatively, check out this short tutorial video:


Here’s one of the examples from the notebooks, in which we solve the CartPole-v0 environment with the SARSA algorithm, using a simple linear function approximator for our Q-function:

import gym
import keras_gym as km
from tensorflow import keras


# the cart-pole MDP
env = gym.make('CartPole-v0')


class Linear(km.FunctionApproximator):
    """ linear function approximator """
    def body(self, X):
        # body is trivial, only flatten and then pass to head (one dense layer)
        return keras.layers.Flatten()(X)


# value function and its derived policy
func = Linear(env, lr=0.001)
q = km.QTypeI(func, update_strategy='sarsa')
policy = km.EpsilonGreedy(q)

# static parameters
num_episodes = 200
num_steps = env.spec.max_episode_steps

# used for early stopping
num_consecutive_successes = 0


# train
for ep in range(num_episodes):
    s = env.reset()
    policy.epsilon = 0.1 if ep < 10 else 0.01

    for t in range(num_steps):
        a = policy(s)
        s_next, r, done, info = env.step(a)

        q.update(s, a, r, done)

        if done:
            if t == num_steps - 1:
                num_consecutive_successes += 1
                print("num_consecutive_successes: {}"
                      .format(num_consecutive_successes))
            else:
                num_consecutive_successes = 0
                print("failed after {} steps".format(t))
            break

        s = s_next

    if num_consecutive_successes == 10:
        break


# run env one more time to render
km.render_episode(env, policy, step_delay_ms=25)