# Glossary¶

In this package we make heavy use of function approximators using
`keras.Model`

objects. In Section 1 we list the available types of
function approximators. A function approximator uses multiple keras models to
support its full functionality. The different types keras models are listed in
Section 2. Finally, in Section 3 we list the different kinds of inputs and
outputs that our keras models expect.

## 1. Function approximator types¶

- function approximator
- A function approximator is any object that can be updated.
- body
- The
*body*is what we call the part of the computation graph that may be shared between e.g. policy (actor) and value function (critic). It is typlically the part of a neural net that does most of the heavy lifting. One may think of the`body()`

as an elaborate automatic feature extractor. - head
The

*head*is the part of the computation graph that actually generates the desired output format/shape. As its input, it takes the output of body. The different heads that`FunctionApproximator`

class provides are:**head_v**This is the state value head. It returns a batch of scalar values V.**head_q1**This is the type-I Q-value head. It returns a batch of scalar values Q_sa.**head_q2**This is the type-II Q-value head. It returns a batch of vectors Q_s.**head_pi**This is the policy head. It returns a batch of distribution parameters Z.- forward_pass
- This is just the consecutive application of head after body.

In this package we have four distinct types of function approximators:

- state value function
- State value functions \(v(s)\) are implemented by
`V`

. - type-I state-action value function
This is the standard state-action value function \(q(s,a)\). It models the Q-function as

\[(s, a) \mapsto q(s,a)\ \in\ \mathbb{R}\]This function approximator is implemented by

`QTypeI`

.- type-II state-action value function
This type of state-action value function is different from type-I in that it models the Q-function as

\[s \mapsto q(s,.)\ \in\ \mathbb{R}^n\]where \(n\) is the number of actions. The type-II Q-function is implemented by

`QTypeII`

.- updateable policy
- This function approximator represents a policy directly. It is
implemented by e.g.
`SoftmaxPolicy`

. - actor-critic
- This is a special function approximator that allows for the sharing of parts of the computation graph between a value function (critic) and a policy (actor).

Note

At the moment, type-II Q-functions and updateable policies are only
implemented for environments with a `Discrete`

action space.

## 2. Keras model types¶

Now each function approximator takes multiple `keras.Model`

objects. The
different models are named according to role they play in the functions
approximator object:

- train_model
- This
`keras.Model`

is used for training. - predict_model
- This
`keras.Model`

is used for predicting. - target_model
- This
`keras.Model`

is a kind of shadow copy of predict_model that is used in off-policy methods. For instance, in DQN we use it for reducing the variance of the bootstrapped target by synchronizing with predict_model only periodically.

Note

The specific input depends on the type of function approximator you’re using. These are provided in each individual class doc.

## 3. Keras model inputs/outputs¶

Each `keras.Model`

object expects specific inputs and outputs. These are
provided in each individual function approximator’s docs.

Below we list the different available arrays that we might use as inputs/outputs to our keras models.

- S
- A batch of (preprocessed) state observations. The shape is
`[batch_size, ...]`

where the ellipses might be any number of dimensions. - A
- A batch of actions taken, with shape
`[batch_size]`

. - P
- A batch of distribution parameters that allow us to construct action
propensities according to the behavior/target policy \(b(a|s)\).
For instance, the parameters of a
`keras_gym.SoftmaxPolicy`

(for discrete actions spaces) are those of a categorical distribution. On the other hand, for continuous action spaces we use a`keras_gym.GaussianPolicy`

, whose parameters are the parameters of the underlying normal distribution. - Z
- Similar to P, this is a batch of distribution parameters. In contrast to P, however, Z represents the primary updateable policy \(\pi_\theta(a|s)\) instead of the behavior/target policy \(b(a|s)\).
- G
- A batch of (\(\gamma\)-discounted) returns, shape:
`[batch_size]`

. - Rn
A batch of partial (\(\gamma\)-discounted) returns. For instance, in n-step bootstrapping these are given by:

\[R^{(n)}_t\ =\ R_t + \gamma\,R_{t+1} + \dots + \gamma^{n-1}\,R_{t+n-1}\]In other words, it’s the part of the n-step return

*without*the bootstrapping term. The shape is`[batch_size]`

.- In
A batch of bootstrap factors. For instance, in n-step bootstrapping these are given by \(I^{(n)}_t=\gamma^n\) when bootstrapping and \(I^{(n)}_t=0\) otherwise. It is used in bootstrapped updates. For instance, the n-step bootstrapped target makes use of it as follows:

\[G^{(n)}_t\ =\ R^{(n)}_t + I^{(n)}_t\,Q(S_{t+1}, A_{t+1})\]The shape is

`[batch_size]`

.- S_next
- A batch of (preprocessed) next-state observations. This is typically
used in bootstrapping (see In). The shape is
`[batch_size, ...]`

where the ellipses might be any number of dimensions. - A_next
- A batch of next-actions to be taken. These can be actions that were
actually taken (on-policy), but they can also be any other would-be
next-actions (off-policy). The shape is
`[batch_size]`

. - P_next
- A batch of action propensities according to the policy \(\pi(a|s)\).
- V
- A batch of V-values \(v(s)\) of shape
`[batch_size]`

. - Q_sa
- A batch of Q-values \(q(s,a)\) of shape
`[batch_size]`

. - Q_s
- A batch of Q-values \(q(s,.)\) of shape
`[batch_size, num_actions]`

. - Adv
- A batch of advantages \(\mathcal{A}(s,a) = q(s,a) - v(s)\), which
has shape:
`[batch_size]`

.