Value Functions

keras_gym.V A state value function \(s\mapsto v(s)\).
keras_gym.QTypeI A type-I state-action value function \((s,a)\mapsto q(s,a)\).
keras_gym.QTypeII A type-II state-action value function \(s\mapsto q(s,.)\).
class keras_gym.V(function_approximator, gamma=0.9, bootstrap_n=1, bootstrap_with_target_model=False)[source]

A state value function \(s\mapsto v(s)\).

Parameters:
function_approximator : FunctionApproximator object

The main function approximator.

gamma : float, optional

The discount factor for discounting future rewards.

bootstrap_n : positive int, optional

The number of steps in n-step bootstrapping. It specifies the number of steps over which we’re willing to delay bootstrapping. Large \(n\) corresponds to Monte Carlo updates and \(n=1\) corresponds to TD(0).

bootstrap_with_target_model : bool, optional

Whether to use the target_model when constructing a bootstrapped target. If False (default), the primary predict_model is used.

__call__(self, s, use_target_model=False)[source]

Evaluate the Q-function.

Parameters:
s : state observation

A single state observation.

use_target_model : bool, optional

Whether to use the target_model internally. If False (default), the predict_model is used.

Returns:
V : float or array of floats

The estimated value of the state \(v(s)\).

batch_eval(self, S, use_target_model=False)[source]

Evaluate the state value function on a batch of state observations.

Parameters:
S : nd array, shape: [batch_size, …]

A batch of state observations.

use_target_model : bool, optional

Whether to use the target_model internally. If False (default), the predict_model is used.

Returns:
V : 1d array, dtype: float, shape: [batch_size]

The predicted state values.

batch_update(self, S, Rn, In, S_next)[source]

Update the value function on a batch of transitions.

Parameters:
S : nd array, shape: [batch_size, …]

A batch of state observations.

Rn : 1d array, dtype: float, shape: [batch_size]

A batch of partial returns. For example, in n-step bootstrapping this is given by:

\[R^{(n)}_t\ =\ R_t + \gamma\,R_{t+1} + \dots \gamma^{n-1}\,R_{t+n-1}\]

In other words, it’s the non-bootstrapped part of the n-step return.

In : 1d array, dtype: float, shape: [batch_size]

A batch bootstrapping factor. For instance, in n-step bootstrapping this is given by \(I^{(n)}_t=\gamma^n\) if the episode is ongoing and \(I^{(n)}_t=0\) otherwise. This allows us to write the bootstrapped target as:

\[G^{(n)}_t=R^{(n)}_t+I^{(n)}_tQ(S_{t+n}, A_{t+n})\]
S_next : nd array, shape: [batch_size, …]

A batch of next-state observations.

Returns:
losses : dict

A dict of losses/metrics, of type {name <str>: value <float>}.

sync_target_model(self, tau=1.0)

Synchronize the target model with the primary model.

Parameters:
tau : float between 0 and 1, optional

The amount of exponential smoothing to apply in the target update:

\[w_\text{target}\ \leftarrow\ (1 - \tau)\,w_\text{target} + \tau\,w_\text{primary}\]
update(self, s, r, done)[source]

Update the Q-function.

Parameters:
s : state observation

A single state observation..

r : float

A single observed reward.

done : bool

Whether the episode has finished.

class keras_gym.QTypeI(function_approximator, gamma=0.9, bootstrap_n=1, bootstrap_with_target_model=False, update_strategy='sarsa')[source]

A type-I state-action value function \((s,a)\mapsto q(s,a)\).

Parameters:
function_approximator : FunctionApproximator object

The main function approximator.

gamma : float, optional

The discount factor for discounting future rewards.

bootstrap_n : positive int, optional

The number of steps in n-step bootstrapping. It specifies the number of steps over which we’re willing to delay bootstrapping. Large \(n\) corresponds to Monte Carlo updates and \(n=1\) corresponds to TD(0).

bootstrap_with_target_model : bool, optional

Whether to use the target_model when constructing a bootstrapped target. If False (default), the primary predict_model is used.

update_strategy : str, optional

The update strategy that we use to select the (would-be) next-action \(A_{t+n}\) in the bootstrapped target:

\[G^{(n)}_t\ =\ R^{(n)}_t + \gamma^n Q(S_{t+n}, A_{t+n})\]

Options are:

‘sarsa’

Sample the next action, i.e. use the action that was actually taken.

‘q_learning’

Take the action with highest Q-value under the current estimate, i.e. \(A_{t+n} = \arg\max_aQ(S_{t+n}, a)\). This is an off-policy method.

‘double_q_learning’

Same as ‘q_learning’, \(A_{t+n} = \arg\max_aQ(S_{t+n}, a)\), except that the value itself is computed using the target_model rather than the primary model, i.e.

\[\begin{split}A_{t+n}\ &=\ \arg\max_aQ_\text{primary}(S_{t+n}, a)\\ G^{(n)}_t\ &=\ R^{(n)}_t + \gamma^n Q_\text{target}(S_{t+n}, A_{t+n})\end{split}\]
‘expected_sarsa’

Similar to SARSA in that it’s on-policy, except that we take the expectated Q-value rather than a sample of it, i.e.

\[G^{(n)}_t\ =\ R^{(n)}_t + \gamma^n\sum_a\pi(a|s)\,Q(S_{t+n}, a)\]
__call__(self, s, a=None, use_target_model=False)

Evaluate the Q-function.

Parameters:
s : state observation

A single state observation.

a : action, optional

A single action.

use_target_model : bool, optional

Whether to use the target_model internally. If False (default), the predict_model is used.

Returns:
Q : float or array of floats

If action a is provided, a single float representing \(q(s,a)\) is returned. If, on the other hand, a is left unspecified, a vector representing \(q(s,.)\) is returned instead. The shape of the latter return value is [num_actions], which is only well-defined for discrete action spaces.

batch_eval(self, S, A=None, use_target_model=False)[source]

Evaluate the Q-function on a batch of state (or state-action) observations.

Parameters:
S : nd array, shape: [batch_size, …]

A batch of state observations.

A : 1d array, dtype: int, shape: [batch_size], optional

A batch of actions that were taken.

use_target_model : bool, optional

Whether to use the target_model internally. If False (default), the predict_model is used.

Returns:
Q : 1d or 2d array of floats

If action A is provided, a 1d array representing a batch of \(q(s,a)\) is returned. If, on the other hand, A is left unspecified, a vector representing a batch of \(q(s,.)\) is returned instead. The shape of the latter return value is [batch_size, num_actions], which is only well-defined for discrete action spaces.

batch_update(self, S, A, Rn, In, S_next, A_next=None)

Update the value function on a batch of transitions.

Parameters:
S : nd array, shape: [batch_size, …]

A batch of state observations.

A : nd Tensor, shape: [batch_size, …]

A batch of actions taken.

Rn : 1d array, dtype: float, shape: [batch_size]

A batch of partial returns. For example, in n-step bootstrapping this is given by:

\[R^{(n)}_t\ =\ R_t + \gamma\,R_{t+1} + \dots \gamma^{n-1}\,R_{t+n-1}\]

In other words, it’s the non-bootstrapped part of the n-step return.

In : 1d array, dtype: float, shape: [batch_size]

A batch bootstrapping factor. For instance, in n-step bootstrapping this is given by \(I^{(n)}_t=\gamma^n\) if the episode is ongoing and \(I^{(n)}_t=0\) otherwise. This allows us to write the bootstrapped target as:

\[G^{(n)}_t=R^{(n)}_t+I^{(n)}_tQ(S_{t+n}, A_{t+n})\]
S_next : nd array, shape: [batch_size, …]

A batch of next-state observations.

A_next : 2d Tensor, shape: [batch_size, …]

A batch of (potential) next actions A_next. This argument is only used if update_strategy='sarsa'.

Returns:
losses : dict

A dict of losses/metrics, of type {name <str>: value <float>}.

bootstrap_target(self, Rn, In, S_next, A_next=None)

Get the bootstrapped target \(G^{(n)}_t=R^{(n)}_t+\gamma^nQ(S_{t+n}, A_{t+n})\).

Parameters:
Rn : 1d array, dtype: float, shape: [batch_size]

A batch of partial returns. For example, in n-step bootstrapping this is given by:

\[R^{(n)}_t\ =\ R_t + \gamma\,R_{t+1} + \dots \gamma^{n-1}\,R_{t+n-1}\]

In other words, it’s the non-bootstrapped part of the n-step return.

In : 1d array, dtype: float, shape: [batch_size]

A batch bootstrapping factor. For instance, in n-step bootstrapping this is given by \(I^{(n)}_t=\gamma^n\) if the episode is ongoing and \(I^{(n)}_t=0\) otherwise. This allows us to write the bootstrapped target as:

\[G^{(n)}_t=R^{(n)}_t+I^{(n)}_tQ(S_{t+n},A_{t+n})\]
S_next : nd array, shape: [batch_size, …]

A batch of next-state observations.

A_next : 2d Tensor, dtype: int, shape: [batch_size, num_actions]

A batch of (potential) next actions A_next. This argument is only used if update_strategy='sarsa'.

Returns:
Gn : 1d array, dtype: int, shape: [batch_size]

A batch of bootstrap-estimated returns \(G^{(n)}_t=R^{(n)}_t+I^{(n)}_tQ(S_{t+n},A_{t+n})\) computed according to given update_strategy.

sync_target_model(self, tau=1.0)

Synchronize the target model with the primary model.

Parameters:
tau : float between 0 and 1, optional

The amount of exponential smoothing to apply in the target update:

\[w_\text{target}\ \leftarrow\ (1 - \tau)\,w_\text{target} + \tau\,w_\text{primary}\]
update(self, s, a, r, done)

Update the Q-function.

Parameters:
s : state observation

A single state observation.

a : action

A single action.

r : float

A single observed reward.

done : bool

Whether the episode has finished.

class keras_gym.QTypeII(function_approximator, gamma=0.9, bootstrap_n=1, bootstrap_with_target_model=False, update_strategy='sarsa')[source]

A type-II state-action value function \(s\mapsto q(s,.)\).

Parameters:
function_approximator : FunctionApproximator object

The main function approximator.

gamma : float, optional

The discount factor for discounting future rewards.

bootstrap_n : positive int, optional

The number of steps in n-step bootstrapping. It specifies the number of steps over which we’re willing to delay bootstrapping. Large \(n\) corresponds to Monte Carlo updates and \(n=1\) corresponds to TD(0).

bootstrap_with_target_model : bool, optional

Whether to use the target_model when constructing a bootstrapped target. If False (default), the primary predict_model is used.

update_strategy : str, optional

The update strategy that we use to select the (would-be) next-action \(A_{t+n}\) in the bootstrapped target:

\[G^{(n)}_t\ =\ R^{(n)}_t + \gamma^n Q(S_{t+n}, A_{t+n})\]

Options are:

‘sarsa’

Sample the next action, i.e. use the action that was actually taken.

‘q_learning’

Take the action with highest Q-value under the current estimate, i.e. \(A_{t+n} = \arg\max_aQ(S_{t+n}, a)\). This is an off-policy method.

‘double_q_learning’

Same as ‘q_learning’, \(A_{t+n} = \arg\max_aQ(S_{t+n}, a)\), except that the value itself is computed using the target_model rather than the primary model, i.e.

\[\begin{split}A_{t+n}\ &=\ \arg\max_aQ_\text{primary}(S_{t+n}, a)\\ G^{(n)}_t\ &=\ R^{(n)}_t + \gamma^n Q_\text{target}(S_{t+n}, A_{t+n})\end{split}\]
‘expected_sarsa’

Similar to SARSA in that it’s on-policy, except that we take the expectated Q-value rather than a sample of it, i.e.

\[G^{(n)}_t\ =\ R^{(n)}_t + \gamma^n\sum_a\pi(a|s)\,Q(S_{t+n}, a)\]
__call__(self, s, a=None, use_target_model=False)

Evaluate the Q-function.

Parameters:
s : state observation

A single state observation.

a : action, optional

A single action.

use_target_model : bool, optional

Whether to use the target_model internally. If False (default), the predict_model is used.

Returns:
Q : float or array of floats

If action a is provided, a single float representing \(q(s,a)\) is returned. If, on the other hand, a is left unspecified, a vector representing \(q(s,.)\) is returned instead. The shape of the latter return value is [num_actions], which is only well-defined for discrete action spaces.

batch_eval(self, S, A=None, use_target_model=False)[source]

Evaluate the Q-function on a batch of state (or state-action) observations.

Parameters:
S : nd array, shape: [batch_size, …]

A batch of state observations.

A : 1d array, dtype: int, shape: [batch_size], optional

A batch of actions that were taken.

use_target_model : bool, optional

Whether to use the target_model internally. If False (default), the predict_model is used.

Returns:
Q : 1d or 2d array of floats

If action A is provided, a 1d array representing a batch of \(q(s,a)\) is returned. If, on the other hand, A is left unspecified, a vector representing a batch of \(q(s,.)\) is returned instead. The shape of the latter return value is [batch_size, num_actions], which is only well-defined for discrete action spaces.

batch_update(self, S, A, Rn, In, S_next, A_next=None)

Update the value function on a batch of transitions.

Parameters:
S : nd array, shape: [batch_size, …]

A batch of state observations.

A : nd Tensor, shape: [batch_size, …]

A batch of actions taken.

Rn : 1d array, dtype: float, shape: [batch_size]

A batch of partial returns. For example, in n-step bootstrapping this is given by:

\[R^{(n)}_t\ =\ R_t + \gamma\,R_{t+1} + \dots \gamma^{n-1}\,R_{t+n-1}\]

In other words, it’s the non-bootstrapped part of the n-step return.

In : 1d array, dtype: float, shape: [batch_size]

A batch bootstrapping factor. For instance, in n-step bootstrapping this is given by \(I^{(n)}_t=\gamma^n\) if the episode is ongoing and \(I^{(n)}_t=0\) otherwise. This allows us to write the bootstrapped target as:

\[G^{(n)}_t=R^{(n)}_t+I^{(n)}_tQ(S_{t+n}, A_{t+n})\]
S_next : nd array, shape: [batch_size, …]

A batch of next-state observations.

A_next : 2d Tensor, shape: [batch_size, …]

A batch of (potential) next actions A_next. This argument is only used if update_strategy='sarsa'.

Returns:
losses : dict

A dict of losses/metrics, of type {name <str>: value <float>}.

bootstrap_target(self, Rn, In, S_next, A_next=None)

Get the bootstrapped target \(G^{(n)}_t=R^{(n)}_t+\gamma^nQ(S_{t+n}, A_{t+n})\).

Parameters:
Rn : 1d array, dtype: float, shape: [batch_size]

A batch of partial returns. For example, in n-step bootstrapping this is given by:

\[R^{(n)}_t\ =\ R_t + \gamma\,R_{t+1} + \dots \gamma^{n-1}\,R_{t+n-1}\]

In other words, it’s the non-bootstrapped part of the n-step return.

In : 1d array, dtype: float, shape: [batch_size]

A batch bootstrapping factor. For instance, in n-step bootstrapping this is given by \(I^{(n)}_t=\gamma^n\) if the episode is ongoing and \(I^{(n)}_t=0\) otherwise. This allows us to write the bootstrapped target as:

\[G^{(n)}_t=R^{(n)}_t+I^{(n)}_tQ(S_{t+n},A_{t+n})\]
S_next : nd array, shape: [batch_size, …]

A batch of next-state observations.

A_next : 2d Tensor, dtype: int, shape: [batch_size, num_actions]

A batch of (potential) next actions A_next. This argument is only used if update_strategy='sarsa'.

Returns:
Gn : 1d array, dtype: int, shape: [batch_size]

A batch of bootstrap-estimated returns \(G^{(n)}_t=R^{(n)}_t+I^{(n)}_tQ(S_{t+n},A_{t+n})\) computed according to given update_strategy.

sync_target_model(self, tau=1.0)

Synchronize the target model with the primary model.

Parameters:
tau : float between 0 and 1, optional

The amount of exponential smoothing to apply in the target update:

\[w_\text{target}\ \leftarrow\ (1 - \tau)\,w_\text{target} + \tau\,w_\text{primary}\]
update(self, s, a, r, done)

Update the Q-function.

Parameters:
s : state observation

A single state observation.

a : action

A single action.

r : float

A single observed reward.

done : bool

Whether the episode has finished.