Value Functions¶
keras_gym.V |
A state value function \(s\mapsto v(s)\). |
keras_gym.QTypeI |
A type-I state-action value function \((s,a)\mapsto q(s,a)\). |
keras_gym.QTypeII |
A type-II state-action value function \(s\mapsto q(s,.)\). |
-
class
keras_gym.
V
(function_approximator, gamma=0.9, bootstrap_n=1, bootstrap_with_target_model=False)[source]¶ A state value function \(s\mapsto v(s)\).
Parameters: - function_approximator : FunctionApproximator object
The main function approximator.
- gamma : float, optional
The discount factor for discounting future rewards.
- bootstrap_n : positive int, optional
The number of steps in n-step bootstrapping. It specifies the number of steps over which we’re willing to delay bootstrapping. Large \(n\) corresponds to Monte Carlo updates and \(n=1\) corresponds to TD(0).
- bootstrap_with_target_model : bool, optional
Whether to use the target_model when constructing a bootstrapped target. If False (default), the primary predict_model is used.
-
__call__
(self, s, use_target_model=False)[source]¶ Evaluate the Q-function.
Parameters: - s : state observation
A single state observation.
- use_target_model : bool, optional
Whether to use the target_model internally. If False (default), the predict_model is used.
Returns: - V : float or array of floats
The estimated value of the state \(v(s)\).
-
batch_eval
(self, S, use_target_model=False)[source]¶ Evaluate the state value function on a batch of state observations.
Parameters: - S : nd array, shape: [batch_size, …]
A batch of state observations.
- use_target_model : bool, optional
Whether to use the target_model internally. If False (default), the predict_model is used.
Returns: - V : 1d array, dtype: float, shape: [batch_size]
The predicted state values.
-
batch_update
(self, S, Rn, In, S_next)[source]¶ Update the value function on a batch of transitions.
Parameters: - S : nd array, shape: [batch_size, …]
A batch of state observations.
- Rn : 1d array, dtype: float, shape: [batch_size]
A batch of partial returns. For example, in n-step bootstrapping this is given by:
\[R^{(n)}_t\ =\ R_t + \gamma\,R_{t+1} + \dots \gamma^{n-1}\,R_{t+n-1}\]In other words, it’s the non-bootstrapped part of the n-step return.
- In : 1d array, dtype: float, shape: [batch_size]
A batch bootstrapping factor. For instance, in n-step bootstrapping this is given by \(I^{(n)}_t=\gamma^n\) if the episode is ongoing and \(I^{(n)}_t=0\) otherwise. This allows us to write the bootstrapped target as:
\[G^{(n)}_t=R^{(n)}_t+I^{(n)}_tQ(S_{t+n}, A_{t+n})\]- S_next : nd array, shape: [batch_size, …]
A batch of next-state observations.
Returns: - losses : dict
A dict of losses/metrics, of type
{name <str>: value <float>}
.
-
sync_target_model
(self, tau=1.0)¶ Synchronize the target model with the primary model.
Parameters: - tau : float between 0 and 1, optional
The amount of exponential smoothing to apply in the target update:
\[w_\text{target}\ \leftarrow\ (1 - \tau)\,w_\text{target} + \tau\,w_\text{primary}\]
-
class
keras_gym.
QTypeI
(function_approximator, gamma=0.9, bootstrap_n=1, bootstrap_with_target_model=False, update_strategy='sarsa')[source]¶ A type-I state-action value function \((s,a)\mapsto q(s,a)\).
Parameters: - function_approximator : FunctionApproximator object
The main function approximator.
- gamma : float, optional
The discount factor for discounting future rewards.
- bootstrap_n : positive int, optional
The number of steps in n-step bootstrapping. It specifies the number of steps over which we’re willing to delay bootstrapping. Large \(n\) corresponds to Monte Carlo updates and \(n=1\) corresponds to TD(0).
- bootstrap_with_target_model : bool, optional
Whether to use the target_model when constructing a bootstrapped target. If False (default), the primary predict_model is used.
- update_strategy : str, optional
The update strategy that we use to select the (would-be) next-action \(A_{t+n}\) in the bootstrapped target:
\[G^{(n)}_t\ =\ R^{(n)}_t + \gamma^n Q(S_{t+n}, A_{t+n})\]Options are:
- ‘sarsa’
Sample the next action, i.e. use the action that was actually taken.
- ‘q_learning’
Take the action with highest Q-value under the current estimate, i.e. \(A_{t+n} = \arg\max_aQ(S_{t+n}, a)\). This is an off-policy method.
- ‘double_q_learning’
Same as ‘q_learning’, \(A_{t+n} = \arg\max_aQ(S_{t+n}, a)\), except that the value itself is computed using the target_model rather than the primary model, i.e.
\[\begin{split}A_{t+n}\ &=\ \arg\max_aQ_\text{primary}(S_{t+n}, a)\\ G^{(n)}_t\ &=\ R^{(n)}_t + \gamma^n Q_\text{target}(S_{t+n}, A_{t+n})\end{split}\]- ‘expected_sarsa’
Similar to SARSA in that it’s on-policy, except that we take the expectated Q-value rather than a sample of it, i.e.
\[G^{(n)}_t\ =\ R^{(n)}_t + \gamma^n\sum_a\pi(a|s)\,Q(S_{t+n}, a)\]
-
__call__
(self, s, a=None, use_target_model=False)¶ Evaluate the Q-function.
Parameters: - s : state observation
A single state observation.
- a : action, optional
A single action.
- use_target_model : bool, optional
Whether to use the target_model internally. If False (default), the predict_model is used.
Returns: - Q : float or array of floats
If action
a
is provided, a single float representing \(q(s,a)\) is returned. If, on the other hand,a
is left unspecified, a vector representing \(q(s,.)\) is returned instead. The shape of the latter return value is[num_actions]
, which is only well-defined for discrete action spaces.
-
batch_eval
(self, S, A=None, use_target_model=False)[source]¶ Evaluate the Q-function on a batch of state (or state-action) observations.
Parameters: - S : nd array, shape: [batch_size, …]
A batch of state observations.
- A : 1d array, dtype: int, shape: [batch_size], optional
A batch of actions that were taken.
- use_target_model : bool, optional
Whether to use the target_model internally. If False (default), the predict_model is used.
Returns: - Q : 1d or 2d array of floats
If action
A
is provided, a 1d array representing a batch of \(q(s,a)\) is returned. If, on the other hand,A
is left unspecified, a vector representing a batch of \(q(s,.)\) is returned instead. The shape of the latter return value is[batch_size, num_actions]
, which is only well-defined for discrete action spaces.
-
batch_update
(self, S, A, Rn, In, S_next, A_next=None)¶ Update the value function on a batch of transitions.
Parameters: - S : nd array, shape: [batch_size, …]
A batch of state observations.
- A : nd Tensor, shape: [batch_size, …]
A batch of actions taken.
- Rn : 1d array, dtype: float, shape: [batch_size]
A batch of partial returns. For example, in n-step bootstrapping this is given by:
\[R^{(n)}_t\ =\ R_t + \gamma\,R_{t+1} + \dots \gamma^{n-1}\,R_{t+n-1}\]In other words, it’s the non-bootstrapped part of the n-step return.
- In : 1d array, dtype: float, shape: [batch_size]
A batch bootstrapping factor. For instance, in n-step bootstrapping this is given by \(I^{(n)}_t=\gamma^n\) if the episode is ongoing and \(I^{(n)}_t=0\) otherwise. This allows us to write the bootstrapped target as:
\[G^{(n)}_t=R^{(n)}_t+I^{(n)}_tQ(S_{t+n}, A_{t+n})\]- S_next : nd array, shape: [batch_size, …]
A batch of next-state observations.
- A_next : 2d Tensor, shape: [batch_size, …]
A batch of (potential) next actions A_next. This argument is only used if
update_strategy='sarsa'
.
Returns: - losses : dict
A dict of losses/metrics, of type
{name <str>: value <float>}
.
-
bootstrap_target
(self, Rn, In, S_next, A_next=None)¶ Get the bootstrapped target \(G^{(n)}_t=R^{(n)}_t+\gamma^nQ(S_{t+n}, A_{t+n})\).
Parameters: - Rn : 1d array, dtype: float, shape: [batch_size]
A batch of partial returns. For example, in n-step bootstrapping this is given by:
\[R^{(n)}_t\ =\ R_t + \gamma\,R_{t+1} + \dots \gamma^{n-1}\,R_{t+n-1}\]In other words, it’s the non-bootstrapped part of the n-step return.
- In : 1d array, dtype: float, shape: [batch_size]
A batch bootstrapping factor. For instance, in n-step bootstrapping this is given by \(I^{(n)}_t=\gamma^n\) if the episode is ongoing and \(I^{(n)}_t=0\) otherwise. This allows us to write the bootstrapped target as:
\[G^{(n)}_t=R^{(n)}_t+I^{(n)}_tQ(S_{t+n},A_{t+n})\]- S_next : nd array, shape: [batch_size, …]
A batch of next-state observations.
- A_next : 2d Tensor, dtype: int, shape: [batch_size, num_actions]
A batch of (potential) next actions A_next. This argument is only used if
update_strategy='sarsa'
.
Returns: - Gn : 1d array, dtype: int, shape: [batch_size]
A batch of bootstrap-estimated returns \(G^{(n)}_t=R^{(n)}_t+I^{(n)}_tQ(S_{t+n},A_{t+n})\) computed according to given
update_strategy
.
-
sync_target_model
(self, tau=1.0)¶ Synchronize the target model with the primary model.
Parameters: - tau : float between 0 and 1, optional
The amount of exponential smoothing to apply in the target update:
\[w_\text{target}\ \leftarrow\ (1 - \tau)\,w_\text{target} + \tau\,w_\text{primary}\]
-
update
(self, s, a, r, done)¶ Update the Q-function.
Parameters: - s : state observation
A single state observation.
- a : action
A single action.
- r : float
A single observed reward.
- done : bool
Whether the episode has finished.
-
class
keras_gym.
QTypeII
(function_approximator, gamma=0.9, bootstrap_n=1, bootstrap_with_target_model=False, update_strategy='sarsa')[source]¶ A type-II state-action value function \(s\mapsto q(s,.)\).
Parameters: - function_approximator : FunctionApproximator object
The main function approximator.
- gamma : float, optional
The discount factor for discounting future rewards.
- bootstrap_n : positive int, optional
The number of steps in n-step bootstrapping. It specifies the number of steps over which we’re willing to delay bootstrapping. Large \(n\) corresponds to Monte Carlo updates and \(n=1\) corresponds to TD(0).
- bootstrap_with_target_model : bool, optional
Whether to use the target_model when constructing a bootstrapped target. If False (default), the primary predict_model is used.
- update_strategy : str, optional
The update strategy that we use to select the (would-be) next-action \(A_{t+n}\) in the bootstrapped target:
\[G^{(n)}_t\ =\ R^{(n)}_t + \gamma^n Q(S_{t+n}, A_{t+n})\]Options are:
- ‘sarsa’
Sample the next action, i.e. use the action that was actually taken.
- ‘q_learning’
Take the action with highest Q-value under the current estimate, i.e. \(A_{t+n} = \arg\max_aQ(S_{t+n}, a)\). This is an off-policy method.
- ‘double_q_learning’
Same as ‘q_learning’, \(A_{t+n} = \arg\max_aQ(S_{t+n}, a)\), except that the value itself is computed using the target_model rather than the primary model, i.e.
\[\begin{split}A_{t+n}\ &=\ \arg\max_aQ_\text{primary}(S_{t+n}, a)\\ G^{(n)}_t\ &=\ R^{(n)}_t + \gamma^n Q_\text{target}(S_{t+n}, A_{t+n})\end{split}\]- ‘expected_sarsa’
Similar to SARSA in that it’s on-policy, except that we take the expectated Q-value rather than a sample of it, i.e.
\[G^{(n)}_t\ =\ R^{(n)}_t + \gamma^n\sum_a\pi(a|s)\,Q(S_{t+n}, a)\]
-
__call__
(self, s, a=None, use_target_model=False)¶ Evaluate the Q-function.
Parameters: - s : state observation
A single state observation.
- a : action, optional
A single action.
- use_target_model : bool, optional
Whether to use the target_model internally. If False (default), the predict_model is used.
Returns: - Q : float or array of floats
If action
a
is provided, a single float representing \(q(s,a)\) is returned. If, on the other hand,a
is left unspecified, a vector representing \(q(s,.)\) is returned instead. The shape of the latter return value is[num_actions]
, which is only well-defined for discrete action spaces.
-
batch_eval
(self, S, A=None, use_target_model=False)[source]¶ Evaluate the Q-function on a batch of state (or state-action) observations.
Parameters: - S : nd array, shape: [batch_size, …]
A batch of state observations.
- A : 1d array, dtype: int, shape: [batch_size], optional
A batch of actions that were taken.
- use_target_model : bool, optional
Whether to use the target_model internally. If False (default), the predict_model is used.
Returns: - Q : 1d or 2d array of floats
If action
A
is provided, a 1d array representing a batch of \(q(s,a)\) is returned. If, on the other hand,A
is left unspecified, a vector representing a batch of \(q(s,.)\) is returned instead. The shape of the latter return value is[batch_size, num_actions]
, which is only well-defined for discrete action spaces.
-
batch_update
(self, S, A, Rn, In, S_next, A_next=None)¶ Update the value function on a batch of transitions.
Parameters: - S : nd array, shape: [batch_size, …]
A batch of state observations.
- A : nd Tensor, shape: [batch_size, …]
A batch of actions taken.
- Rn : 1d array, dtype: float, shape: [batch_size]
A batch of partial returns. For example, in n-step bootstrapping this is given by:
\[R^{(n)}_t\ =\ R_t + \gamma\,R_{t+1} + \dots \gamma^{n-1}\,R_{t+n-1}\]In other words, it’s the non-bootstrapped part of the n-step return.
- In : 1d array, dtype: float, shape: [batch_size]
A batch bootstrapping factor. For instance, in n-step bootstrapping this is given by \(I^{(n)}_t=\gamma^n\) if the episode is ongoing and \(I^{(n)}_t=0\) otherwise. This allows us to write the bootstrapped target as:
\[G^{(n)}_t=R^{(n)}_t+I^{(n)}_tQ(S_{t+n}, A_{t+n})\]- S_next : nd array, shape: [batch_size, …]
A batch of next-state observations.
- A_next : 2d Tensor, shape: [batch_size, …]
A batch of (potential) next actions A_next. This argument is only used if
update_strategy='sarsa'
.
Returns: - losses : dict
A dict of losses/metrics, of type
{name <str>: value <float>}
.
-
bootstrap_target
(self, Rn, In, S_next, A_next=None)¶ Get the bootstrapped target \(G^{(n)}_t=R^{(n)}_t+\gamma^nQ(S_{t+n}, A_{t+n})\).
Parameters: - Rn : 1d array, dtype: float, shape: [batch_size]
A batch of partial returns. For example, in n-step bootstrapping this is given by:
\[R^{(n)}_t\ =\ R_t + \gamma\,R_{t+1} + \dots \gamma^{n-1}\,R_{t+n-1}\]In other words, it’s the non-bootstrapped part of the n-step return.
- In : 1d array, dtype: float, shape: [batch_size]
A batch bootstrapping factor. For instance, in n-step bootstrapping this is given by \(I^{(n)}_t=\gamma^n\) if the episode is ongoing and \(I^{(n)}_t=0\) otherwise. This allows us to write the bootstrapped target as:
\[G^{(n)}_t=R^{(n)}_t+I^{(n)}_tQ(S_{t+n},A_{t+n})\]- S_next : nd array, shape: [batch_size, …]
A batch of next-state observations.
- A_next : 2d Tensor, dtype: int, shape: [batch_size, num_actions]
A batch of (potential) next actions A_next. This argument is only used if
update_strategy='sarsa'
.
Returns: - Gn : 1d array, dtype: int, shape: [batch_size]
A batch of bootstrap-estimated returns \(G^{(n)}_t=R^{(n)}_t+I^{(n)}_tQ(S_{t+n},A_{t+n})\) computed according to given
update_strategy
.
-
sync_target_model
(self, tau=1.0)¶ Synchronize the target model with the primary model.
Parameters: - tau : float between 0 and 1, optional
The amount of exponential smoothing to apply in the target update:
\[w_\text{target}\ \leftarrow\ (1 - \tau)\,w_\text{target} + \tau\,w_\text{primary}\]
-
update
(self, s, a, r, done)¶ Update the Q-function.
Parameters: - s : state observation
A single state observation.
- a : action
A single action.
- r : float
A single observed reward.
- done : bool
Whether the episode has finished.