Function Approximators

The central object object in this package is the keras_gym.FunctionApproximator, which provides an interface between a gym-type environment and function approximators like value functions and updateable policies.

FunctionApproximator class

The way we would define a function approximator is by specifying a body. For instance, the example below specifies a simple multi-layer perceptron:

import gym
import keras_gym as km
from tensorflow import keras

class MLP(km.FunctionApproximator):
    """ multi-layer perceptron with one hidden layer """
    def body(self, S):
        X = keras.layers.Flatten()(S)
        X = keras.layers.Dense(units=4)(X)
        return X

# environment
env = gym.make(...)

# value function and its derived policy
function_approximator = MLP(env, lr=0.01)

This function_approximator can now be used to construct a value function or updateable policy, which we cover in the remainder of this page.

Predefined Function Approximators

Although it’s pretty easy to create a custom function approximator, keras-gym also provides some predefined function approximators. They are listed here.

Value Functions

Value functions estimate the expected (discounted) sum of future rewards. For instance, state value functions are defined as:

\[v(s)\ =\ \mathbb{E}_t\left\{ R_t + \gamma R_{t+1} + \gamma^2 R_{t+2} + \dots\ \Big|\ S_t=s \right\}\]

Here, the \(R\) are the individual rewards we receive from the Markov Decision Process (MDP) at each time step.

In keras-gym we define a state value functions as follows:

v = km.V(function_approximator, gamma=0.9, bootstrap_n=1)

The function_approximator is discussed above. The other arguments set the discount factor \(\gamma\in[0,1]\) and the number of steps over which to bootstrap.

Similar to state value functions, we can also define state-action value functions:

\[q(s, a)\ =\ \mathbb{E}_t\left\{ R_t + \gamma R_{t+1} + \gamma^2 R_{t+2} + \dots\ \Big|\ S_t=s, A_t=a \right\}\]

keras-gym provides two distinct ways to define such a Q-function, which are refered to as type-I and type-II Q-functions. The difference between the two is in how the function approximator models the Q-function. A type-I Q-function models the Q-function as \((s, a)\mapsto q(s, a)\in\mathbb{R}\), whereas a type-II Q-function models it as \(s\mapsto q(s,.)\in\mathbb{R}^n\). Here, \(n\) is the number of actions, which means that this is only well-defined for discrete action spaces.

In keras-gym we define a type-I Q-function as follows:

q = km.QTypeI(function_approximator, update_strategy='sarsa')

and similarly for type-II:

q = km.QTypeII(function_approximator, update_strategy='sarsa')

The update_strategy argument specifies our bootstrapping target. Available choices are 'sarsa', 'q_learning' and 'double_q_learning'.

The main reason for using a Q-function is for value-based control. In other words, we typically want to derive a policy from the Q-function. This is pretty straightforward too:

pi = km.EpsilonGreedy(q, epsilon=0.1)

# the epsilon parameter may be updated dynamically

Updateable Policies

Besides value-based control in which we derive a policy from a Q-function, we can also do policy-based control. In policy-based methods we learn a policy directly as a probability distribution over the space of actions \(\pi(a|s)\).

The updateable policies for discrete action spaces are known as softmax policies:

\[\pi(a|s)\ =\ \frac{\exp z(s,a)}{\sum_{a'}\exp z(s,a')}\]

where the logits are defined over the real line \(z(s,a)\in\mathbb{R}\).

In keras-gym we define a softmax policy as follows:

pi = km.SoftmaxPolicy(function_approximator, update_strategy='vanilla')

Similar to Q-functions, we can pick different update strategies. Available options for policies are 'vanilla', 'ppo' and 'cross_entropy'. These specify the objective function used in our policy updates.


It’s often useful to combine a policy with a value function into what is called an actor-critic. The value function (critic) can be used to aid the update procedure for the policy (actor). The keras-gym package provides simple way of constructing an actor-critic using the ActorCritic class:

# separate policy and value function
pi = km.SoftmaxPolicy(function_approximator, update_strategy='vanilla')
v = km.V(function_approximator, gamma=0.9, bootstrap_n=1)

# combine them into a single actor-critic
actor_critic = km.ActorCritic(pi, v)