In this package we make heavy use of function approximators using keras.Model objects. In Section 1 we list the available types of function approximators. A function approximator uses multiple keras models to support its full functionality. The different types keras models are listed in Section 2. Finally, in Section 3 we list the different kinds of inputs and outputs that our keras models expect.

1. Function approximator types

function approximator
A function approximator is any object that can be updated.
The body is what we call the part of the computation graph that may be shared between e.g. policy (actor) and value function (critic). It is typlically the part of a neural net that does most of the heavy lifting. One may think of the body() as an elaborate automatic feature extractor.

The head is the part of the computation graph that actually generates the desired output format/shape. As its input, it takes the output of body. The different heads that FunctionApproximator class provides are:


This is the state value head. It returns a batch of scalar values V.


This is the type-I Q-value head. It returns a batch of scalar values Q_sa.


This is the type-II Q-value head. It returns a batch of vectors Q_s.


This is the policy head. It returns a batch of distribution parameters Z.
This is just the consecutive application of head after body.

In this package we have four distinct types of function approximators:

state value function
State value functions \(v(s)\) are implemented by V.
type-I state-action value function

This is the standard state-action value function \(q(s,a)\). It models the Q-function as

\[(s, a) \mapsto q(s,a)\ \in\ \mathbb{R}\]

This function approximator is implemented by QTypeI.

type-II state-action value function

This type of state-action value function is different from type-I in that it models the Q-function as

\[s \mapsto q(s,.)\ \in\ \mathbb{R}^n\]

where \(n\) is the number of actions. The type-II Q-function is implemented by QTypeII.

updateable policy
This function approximator represents a policy directly. It is implemented by e.g. SoftmaxPolicy.
This is a special function approximator that allows for the sharing of parts of the computation graph between a value function (critic) and a policy (actor).


At the moment, type-II Q-functions and updateable policies are only implemented for environments with a Discrete action space.

2. Keras model types

Now each function approximator takes multiple keras.Model objects. The different models are named according to role they play in the functions approximator object:

This keras.Model is used for training.
This keras.Model is used for predicting.
This keras.Model is a kind of shadow copy of predict_model that is used in off-policy methods. For instance, in DQN we use it for reducing the variance of the bootstrapped target by synchronizing with predict_model only periodically.


The specific input depends on the type of function approximator you’re using. These are provided in each individual class doc.

3. Keras model inputs/outputs

Each keras.Model object expects specific inputs and outputs. These are provided in each individual function approximator’s docs.

Below we list the different available arrays that we might use as inputs/outputs to our keras models.

A batch of (preprocessed) state observations. The shape is [batch_size, ...] where the ellipses might be any number of dimensions.
A batch of actions taken, with shape [batch_size].
A batch of distribution parameters that allow us to construct action propensities according to the behavior/target policy \(b(a|s)\). For instance, the parameters of a keras_gym.SoftmaxPolicy (for discrete actions spaces) are those of a categorical distribution. On the other hand, for continuous action spaces we use a keras_gym.GaussianPolicy, whose parameters are the parameters of the underlying normal distribution.
Similar to P, this is a batch of distribution parameters. In contrast to P, however, Z represents the primary updateable policy \(\pi_\theta(a|s)\) instead of the behavior/target policy \(b(a|s)\).
A batch of (\(\gamma\)-discounted) returns, shape: [batch_size].

A batch of partial (\(\gamma\)-discounted) returns. For instance, in n-step bootstrapping these are given by:

\[R^{(n)}_t\ =\ R_t + \gamma\,R_{t+1} + \dots + \gamma^{n-1}\,R_{t+n-1}\]

In other words, it’s the part of the n-step return without the bootstrapping term. The shape is [batch_size].


A batch of bootstrap factors. For instance, in n-step bootstrapping these are given by \(I^{(n)}_t=\gamma^n\) when bootstrapping and \(I^{(n)}_t=0\) otherwise. It is used in bootstrapped updates. For instance, the n-step bootstrapped target makes use of it as follows:

\[G^{(n)}_t\ =\ R^{(n)}_t + I^{(n)}_t\,Q(S_{t+1}, A_{t+1})\]

The shape is [batch_size].

A batch of (preprocessed) next-state observations. This is typically used in bootstrapping (see In). The shape is [batch_size, ...] where the ellipses might be any number of dimensions.
A batch of next-actions to be taken. These can be actions that were actually taken (on-policy), but they can also be any other would-be next-actions (off-policy). The shape is [batch_size].
A batch of action propensities according to the policy \(\pi(a|s)\).
A batch of V-values \(v(s)\) of shape [batch_size].
A batch of Q-values \(q(s,a)\) of shape [batch_size].
A batch of Q-values \(q(s,.)\) of shape [batch_size, num_actions].
A batch of advantages \(\mathcal{A}(s,a) = q(s,a) - v(s)\), which has shape: [batch_size].