Value Functions¶

`keras_gym.V`	A state value function \(s\mapsto v(s)\).
`keras_gym.QTypeI`	A type-I state-action value function \((s,a)\mapsto q(s,a)\).
`keras_gym.QTypeII`	A type-II state-action value function \(s\mapsto q(s,.)\).

class keras_gym.V(function_approximator, gamma=0.9, bootstrap_n=1, bootstrap_with_target_model=False)[source]¶

A state value function \(s\mapsto v(s)\).

Parameters:

function_approximator : FunctionApproximator object: The main function approximator.
gamma : float, optional: The discount factor for discounting future rewards.
bootstrap_n : positive int, optional: The number of steps in n-step bootstrapping. It specifies the number of steps over which we’re willing to delay bootstrapping. Large \(n\) corresponds to Monte Carlo updates and \(n=1\) corresponds to TD(0).
bootstrap_with_target_model : bool, optional: Whether to use the target_model when constructing a bootstrapped target. If False (default), the primary predict_model is used.

__call__(self, s, use_target_model=False)[source]¶

Evaluate the Q-function.

Parameters:	s : state observation A single state observation. use_target_model : bool, optional Whether to use the target_model internally. If False (default), the predict_model is used.
Returns:	V : float or array of floats The estimated value of the state \(v(s)\).

batch_eval(self, S, use_target_model=False)[source]¶

Evaluate the state value function on a batch of state observations.

Parameters:	S : nd array, shape: [batch_size, …] A batch of state observations. use_target_model : bool, optional Whether to use the target_model internally. If False (default), the predict_model is used.
Returns:	V : 1d array, dtype: float, shape: [batch_size] The predicted state values.

batch_update(self, S, Rn, In, S_next)[source]¶

Update the value function on a batch of transitions.

Parameters:

S : nd array, shape: [batch_size, …]

A batch of state observations.

Rn : 1d array, dtype: float, shape: [batch_size]

A batch of partial returns. For example, in n-step bootstrapping this is given by:

\[R^{(n)}_t\ =\ R_t + \gamma\,R_{t+1} + \dots \gamma^{n-1}\,R_{t+n-1}\]

In other words, it’s the non-bootstrapped part of the n-step return.

In : 1d array, dtype: float, shape: [batch_size]

A batch bootstrapping factor. For instance, in n-step bootstrapping this is given by \(I^{(n)}_t=\gamma^n\) if the episode is ongoing and \(I^{(n)}_t=0\) otherwise. This allows us to write the bootstrapped target as:

\[G^{(n)}_t=R^{(n)}_t+I^{(n)}_tQ(S_{t+n}, A_{t+n})\]

S_next : nd array, shape: [batch_size, …]

A batch of next-state observations.

Returns:

losses : dict: A dict of losses/metrics, of type {name <str>: value <float>}.

sync_target_model(self, tau=1.0)¶

Synchronize the target model with the primary model.

Parameters:	tau : float between 0 and 1, optional The amount of exponential smoothing to apply in the target update: \[w_\text{target}\ \leftarrow\ (1 - \tau)\,w_\text{target} + \tau\,w_\text{primary}\]

update(self, s, r, done)[source]¶

Update the Q-function.

Parameters:	s : state observation A single state observation.. r : float A single observed reward. done : bool Whether the episode has finished.

class keras_gym.QTypeI(function_approximator, gamma=0.9, bootstrap_n=1, bootstrap_with_target_model=False, update_strategy='sarsa')[source]¶

A type-I state-action value function \((s,a)\mapsto q(s,a)\).

Parameters:

function_approximator : FunctionApproximator object

The main function approximator.

gamma : float, optional

The discount factor for discounting future rewards.

bootstrap_n : positive int, optional

The number of steps in n-step bootstrapping. It specifies the number of steps over which we’re willing to delay bootstrapping. Large \(n\) corresponds to Monte Carlo updates and \(n=1\) corresponds to TD(0).

bootstrap_with_target_model : bool, optional

Whether to use the target_model when constructing a bootstrapped target. If False (default), the primary predict_model is used.

update_strategy : str, optional

The update strategy that we use to select the (would-be) next-action \(A_{t+n}\) in the bootstrapped target:

\[G^{(n)}_t\ =\ R^{(n)}_t + \gamma^n Q(S_{t+n}, A_{t+n})\]

Options are:

‘sarsa’

Sample the next action, i.e. use the action that was actually taken.

‘q_learning’

Take the action with highest Q-value under the current estimate, i.e. \(A_{t+n} = \arg\max_aQ(S_{t+n}, a)\). This is an off-policy method.

‘double_q_learning’

Same as ‘q_learning’, \(A_{t+n} = \arg\max_aQ(S_{t+n}, a)\), except that the value itself is computed using the target_model rather than the primary model, i.e.

\[\begin{split}A_{t+n}\ &=\ \arg\max_aQ_\text{primary}(S_{t+n}, a)\\ G^{(n)}_t\ &=\ R^{(n)}_t + \gamma^n Q_\text{target}(S_{t+n}, A_{t+n})\end{split}\]

‘expected_sarsa’

Similar to SARSA in that it’s on-policy, except that we take the expectated Q-value rather than a sample of it, i.e.

\[G^{(n)}_t\ =\ R^{(n)}_t + \gamma^n\sum_a\pi(a|s)\,Q(S_{t+n}, a)\]

__call__(self, s, a=None, use_target_model=False)¶

Evaluate the Q-function.

Parameters:	s : state observation A single state observation. a : action, optional A single action. use_target_model : bool, optional Whether to use the target_model internally. If False (default), the predict_model is used.
Returns:	Q : float or array of floats If action `a` is provided, a single float representing \(q(s,a)\) is returned. If, on the other hand, `a` is left unspecified, a vector representing \(q(s,.)\) is returned instead. The shape of the latter return value is `[num_actions]`, which is only well-defined for discrete action spaces.

batch_eval(self, S, A=None, use_target_model=False)[source]¶

Evaluate the Q-function on a batch of state (or state-action) observations.

Parameters:	S : nd array, shape: [batch_size, …] A batch of state observations. A : 1d array, dtype: int, shape: [batch_size], optional A batch of actions that were taken. use_target_model : bool, optional Whether to use the target_model internally. If False (default), the predict_model is used.
Returns:	Q : 1d or 2d array of floats If action `A` is provided, a 1d array representing a batch of \(q(s,a)\) is returned. If, on the other hand, `A` is left unspecified, a vector representing a batch of \(q(s,.)\) is returned instead. The shape of the latter return value is `[batch_size, num_actions]`, which is only well-defined for discrete action spaces.

batch_update(self, S, A, Rn, In, S_next, A_next=None)¶

Update the value function on a batch of transitions.

Parameters:

S : nd array, shape: [batch_size, …]

A batch of state observations.

A : nd Tensor, shape: [batch_size, …]

A batch of actions taken.

Rn : 1d array, dtype: float, shape: [batch_size]

A batch of partial returns. For example, in n-step bootstrapping this is given by:

\[R^{(n)}_t\ =\ R_t + \gamma\,R_{t+1} + \dots \gamma^{n-1}\,R_{t+n-1}\]

In other words, it’s the non-bootstrapped part of the n-step return.

In : 1d array, dtype: float, shape: [batch_size]

A batch bootstrapping factor. For instance, in n-step bootstrapping this is given by \(I^{(n)}_t=\gamma^n\) if the episode is ongoing and \(I^{(n)}_t=0\) otherwise. This allows us to write the bootstrapped target as:

\[G^{(n)}_t=R^{(n)}_t+I^{(n)}_tQ(S_{t+n}, A_{t+n})\]

S_next : nd array, shape: [batch_size, …]

A batch of next-state observations.

A_next : 2d Tensor, shape: [batch_size, …]

A batch of (potential) next actions A_next. This argument is only used if update_strategy='sarsa'.

Returns:

losses : dict: A dict of losses/metrics, of type {name <str>: value <float>}.

bootstrap_target(self, Rn, In, S_next, A_next=None)¶

Get the bootstrapped target \(G^{(n)}_t=R^{(n)}_t+\gamma^nQ(S_{t+n}, A_{t+n})\).

Parameters:

Rn : 1d array, dtype: float, shape: [batch_size]

A batch of partial returns. For example, in n-step bootstrapping this is given by:

\[R^{(n)}_t\ =\ R_t + \gamma\,R_{t+1} + \dots \gamma^{n-1}\,R_{t+n-1}\]

In other words, it’s the non-bootstrapped part of the n-step return.

In : 1d array, dtype: float, shape: [batch_size]

A batch bootstrapping factor. For instance, in n-step bootstrapping this is given by \(I^{(n)}_t=\gamma^n\) if the episode is ongoing and \(I^{(n)}_t=0\) otherwise. This allows us to write the bootstrapped target as:

\[G^{(n)}_t=R^{(n)}_t+I^{(n)}_tQ(S_{t+n},A_{t+n})\]

S_next : nd array, shape: [batch_size, …]

A batch of next-state observations.

A_next : 2d Tensor, dtype: int, shape: [batch_size, num_actions]

A batch of (potential) next actions A_next. This argument is only used if update_strategy='sarsa'.

Returns:

Gn : 1d array, dtype: int, shape: [batch_size]: A batch of bootstrap-estimated returns \(G^{(n)}_t=R^{(n)}_t+I^{(n)}_tQ(S_{t+n},A_{t+n})\) computed according to given update_strategy.

sync_target_model(self, tau=1.0)¶

Synchronize the target model with the primary model.

Parameters:	tau : float between 0 and 1, optional The amount of exponential smoothing to apply in the target update: \[w_\text{target}\ \leftarrow\ (1 - \tau)\,w_\text{target} + \tau\,w_\text{primary}\]

update(self, s, a, r, done)¶

Update the Q-function.

Parameters:	s : state observation A single state observation. a : action A single action. r : float A single observed reward. done : bool Whether the episode has finished.

class keras_gym.QTypeII(function_approximator, gamma=0.9, bootstrap_n=1, bootstrap_with_target_model=False, update_strategy='sarsa')[source]¶

A type-II state-action value function \(s\mapsto q(s,.)\).

Parameters:

function_approximator : FunctionApproximator object

The main function approximator.

gamma : float, optional

The discount factor for discounting future rewards.

bootstrap_n : positive int, optional

The number of steps in n-step bootstrapping. It specifies the number of steps over which we’re willing to delay bootstrapping. Large \(n\) corresponds to Monte Carlo updates and \(n=1\) corresponds to TD(0).

bootstrap_with_target_model : bool, optional

Whether to use the target_model when constructing a bootstrapped target. If False (default), the primary predict_model is used.

update_strategy : str, optional

The update strategy that we use to select the (would-be) next-action \(A_{t+n}\) in the bootstrapped target:

\[G^{(n)}_t\ =\ R^{(n)}_t + \gamma^n Q(S_{t+n}, A_{t+n})\]

Options are:

‘sarsa’

Sample the next action, i.e. use the action that was actually taken.

‘q_learning’

Take the action with highest Q-value under the current estimate, i.e. \(A_{t+n} = \arg\max_aQ(S_{t+n}, a)\). This is an off-policy method.

‘double_q_learning’

Same as ‘q_learning’, \(A_{t+n} = \arg\max_aQ(S_{t+n}, a)\), except that the value itself is computed using the target_model rather than the primary model, i.e.

\[\begin{split}A_{t+n}\ &=\ \arg\max_aQ_\text{primary}(S_{t+n}, a)\\ G^{(n)}_t\ &=\ R^{(n)}_t + \gamma^n Q_\text{target}(S_{t+n}, A_{t+n})\end{split}\]

‘expected_sarsa’

Similar to SARSA in that it’s on-policy, except that we take the expectated Q-value rather than a sample of it, i.e.

\[G^{(n)}_t\ =\ R^{(n)}_t + \gamma^n\sum_a\pi(a|s)\,Q(S_{t+n}, a)\]

__call__(self, s, a=None, use_target_model=False)¶

Evaluate the Q-function.

Parameters:	s : state observation A single state observation. a : action, optional A single action. use_target_model : bool, optional Whether to use the target_model internally. If False (default), the predict_model is used.
Returns:	Q : float or array of floats If action `a` is provided, a single float representing \(q(s,a)\) is returned. If, on the other hand, `a` is left unspecified, a vector representing \(q(s,.)\) is returned instead. The shape of the latter return value is `[num_actions]`, which is only well-defined for discrete action spaces.

batch_eval(self, S, A=None, use_target_model=False)[source]¶

Evaluate the Q-function on a batch of state (or state-action) observations.

Parameters:	S : nd array, shape: [batch_size, …] A batch of state observations. A : 1d array, dtype: int, shape: [batch_size], optional A batch of actions that were taken. use_target_model : bool, optional Whether to use the target_model internally. If False (default), the predict_model is used.
Returns:	Q : 1d or 2d array of floats If action `A` is provided, a 1d array representing a batch of \(q(s,a)\) is returned. If, on the other hand, `A` is left unspecified, a vector representing a batch of \(q(s,.)\) is returned instead. The shape of the latter return value is `[batch_size, num_actions]`, which is only well-defined for discrete action spaces.

batch_update(self, S, A, Rn, In, S_next, A_next=None)¶

Update the value function on a batch of transitions.

Parameters:

S : nd array, shape: [batch_size, …]

A batch of state observations.

A : nd Tensor, shape: [batch_size, …]

A batch of actions taken.

Rn : 1d array, dtype: float, shape: [batch_size]

A batch of partial returns. For example, in n-step bootstrapping this is given by:

\[R^{(n)}_t\ =\ R_t + \gamma\,R_{t+1} + \dots \gamma^{n-1}\,R_{t+n-1}\]

In other words, it’s the non-bootstrapped part of the n-step return.

In : 1d array, dtype: float, shape: [batch_size]

A batch bootstrapping factor. For instance, in n-step bootstrapping this is given by \(I^{(n)}_t=\gamma^n\) if the episode is ongoing and \(I^{(n)}_t=0\) otherwise. This allows us to write the bootstrapped target as:

\[G^{(n)}_t=R^{(n)}_t+I^{(n)}_tQ(S_{t+n}, A_{t+n})\]

S_next : nd array, shape: [batch_size, …]

A batch of next-state observations.

A_next : 2d Tensor, shape: [batch_size, …]

A batch of (potential) next actions A_next. This argument is only used if update_strategy='sarsa'.

Returns:

losses : dict: A dict of losses/metrics, of type {name <str>: value <float>}.

bootstrap_target(self, Rn, In, S_next, A_next=None)¶

Get the bootstrapped target \(G^{(n)}_t=R^{(n)}_t+\gamma^nQ(S_{t+n}, A_{t+n})\).

Parameters:

Rn : 1d array, dtype: float, shape: [batch_size]

A batch of partial returns. For example, in n-step bootstrapping this is given by:

\[R^{(n)}_t\ =\ R_t + \gamma\,R_{t+1} + \dots \gamma^{n-1}\,R_{t+n-1}\]

In other words, it’s the non-bootstrapped part of the n-step return.

In : 1d array, dtype: float, shape: [batch_size]

A batch bootstrapping factor. For instance, in n-step bootstrapping this is given by \(I^{(n)}_t=\gamma^n\) if the episode is ongoing and \(I^{(n)}_t=0\) otherwise. This allows us to write the bootstrapped target as:

\[G^{(n)}_t=R^{(n)}_t+I^{(n)}_tQ(S_{t+n},A_{t+n})\]

S_next : nd array, shape: [batch_size, …]

A batch of next-state observations.

A_next : 2d Tensor, dtype: int, shape: [batch_size, num_actions]

A batch of (potential) next actions A_next. This argument is only used if update_strategy='sarsa'.

Returns:

Gn : 1d array, dtype: int, shape: [batch_size]: A batch of bootstrap-estimated returns \(G^{(n)}_t=R^{(n)}_t+I^{(n)}_tQ(S_{t+n},A_{t+n})\) computed according to given update_strategy.

sync_target_model(self, tau=1.0)¶

Synchronize the target model with the primary model.

Parameters:	tau : float between 0 and 1, optional The amount of exponential smoothing to apply in the target update: \[w_\text{target}\ \leftarrow\ (1 - \tau)\,w_\text{target} + \tau\,w_\text{primary}\]

update(self, s, a, r, done)¶

Update the Q-function.

Parameters:	s : state observation A single state observation. a : action A single action. r : float A single observed reward. done : bool Whether the episode has finished.