# Function Approximators¶

The central object object in this package is the keras_gym.FunctionApproximator, which provides an interface between a gym-type environment and function approximators like value functions and updateable policies.

## FunctionApproximator class¶

The way we would define a function approximator is by specifying a body. For instance, the example below specifies a simple multi-layer perceptron:

import gym
import keras_gym as km
from tensorflow import keras

class MLP(km.FunctionApproximator):
""" multi-layer perceptron with one hidden layer """
def body(self, S):
X = keras.layers.Flatten()(S)
X = keras.layers.Dense(units=4)(X)
return X

# environment
env = gym.make(...)

# value function and its derived policy
function_approximator = MLP(env, lr=0.01)


This function_approximator can now be used to construct a value function or updateable policy, which we cover in the remainder of this page.

## Predefined Function Approximators¶

Although it’s pretty easy to create a custom function approximator, keras-gym also provides some predefined function approximators. They are listed here.

## Value Functions¶

Value functions estimate the expected (discounted) sum of future rewards. For instance, state value functions are defined as:

$v(s)\ =\ \mathbb{E}_t\left\{ R_t + \gamma R_{t+1} + \gamma^2 R_{t+2} + \dots\ \Big|\ S_t=s \right\}$

Here, the $$R$$ are the individual rewards we receive from the Markov Decision Process (MDP) at each time step.

In keras-gym we define a state value functions as follows:

v = km.V(function_approximator, gamma=0.9, bootstrap_n=1)


The function_approximator is discussed above. The other arguments set the discount factor $$\gamma\in[0,1]$$ and the number of steps over which to bootstrap.

Similar to state value functions, we can also define state-action value functions:

$q(s, a)\ =\ \mathbb{E}_t\left\{ R_t + \gamma R_{t+1} + \gamma^2 R_{t+2} + \dots\ \Big|\ S_t=s, A_t=a \right\}$

keras-gym provides two distinct ways to define such a Q-function, which are refered to as type-I and type-II Q-functions. The difference between the two is in how the function approximator models the Q-function. A type-I Q-function models the Q-function as $$(s, a)\mapsto q(s, a)\in\mathbb{R}$$, whereas a type-II Q-function models it as $$s\mapsto q(s,.)\in\mathbb{R}^n$$. Here, $$n$$ is the number of actions, which means that this is only well-defined for discrete action spaces.

In keras-gym we define a type-I Q-function as follows:

q = km.QTypeI(function_approximator, update_strategy='sarsa')


and similarly for type-II:

q = km.QTypeII(function_approximator, update_strategy='sarsa')


The update_strategy argument specifies our bootstrapping target. Available choices are 'sarsa', 'q_learning' and 'double_q_learning'.

The main reason for using a Q-function is for value-based control. In other words, we typically want to derive a policy from the Q-function. This is pretty straightforward too:

pi = km.EpsilonGreedy(q, epsilon=0.1)

# the epsilon parameter may be updated dynamically
pi.set_epsilon(0.25)


## Updateable Policies¶

Besides value-based control in which we derive a policy from a Q-function, we can also do policy-based control. In policy-based methods we learn a policy directly as a probability distribution over the space of actions $$\pi(a|s)$$.

The updateable policies for discrete action spaces are known as softmax policies:

$\pi(a|s)\ =\ \frac{\exp z(s,a)}{\sum_{a'}\exp z(s,a')}$

where the logits are defined over the real line $$z(s,a)\in\mathbb{R}$$.

In keras-gym we define a softmax policy as follows:

pi = km.SoftmaxPolicy(function_approximator, update_strategy='vanilla')


Similar to Q-functions, we can pick different update strategies. Available options for policies are 'vanilla', 'ppo' and 'cross_entropy'. These specify the objective function used in our policy updates.

## Actor-Critics¶

It’s often useful to combine a policy with a value function into what is called an actor-critic. The value function (critic) can be used to aid the update procedure for the policy (actor). The keras-gym package provides simple way of constructing an actor-critic using the ActorCritic class:

# separate policy and value function
pi = km.SoftmaxPolicy(function_approximator, update_strategy='vanilla')
v = km.V(function_approximator, gamma=0.9, bootstrap_n=1)

# combine them into a single actor-critic
actor_critic = km.ActorCritic(pi, v)