Updateable Policies¶
keras_gym.GaussianPolicy |
An updateable policy for environments with a continuous action space, i.e. |
keras_gym.SoftmaxPolicy |
Updateable policy for discrete action spaces. |
-
class
keras_gym.
GaussianPolicy
(function_approximator, update_strategy='vanilla', ppo_clip_eps=0.2, entropy_beta=0.01, random_seed=None)[source]¶ An updateable policy for environments with a continuous action space, i.e. a
Box
. It models the policy \(\pi_\theta(a|s)\) as a normal distribution with conditional parameters \((\mu_\theta(s), \sigma_\theta(s))\).Important
This environment requires that the
env
is with:env = km.wrappers.BoxToReals(env)
This wrapper decompactifies the Box action space.
Parameters: - function_approximator : FunctionApproximator object
The main function approximator.
- update_strategy : str, optional
The strategy for updating our policy. This typically determines the loss function that we use for our policy function approximator.
Options are:
- ‘vanilla’
Plain vanilla policy gradient. The corresponding (surrogate) loss function that we use is:
\[J(\theta)\ =\ \hat{\mathbb{E}}_t \left\{ -\mathcal{A}_t\,\log\pi_\theta(A_t|S_t) \right\}\]where \(\mathcal{A}_t=\mathcal{A}(S_t,A_t)\) is the advantage at time step \(t\).
- ‘ppo’
Proximal policy optimization uses a clipped proximal loss:
\[J(\theta)\ =\ \hat{\mathbb{E}}_t \left\{ \min\Big( \rho_t(\theta)\,\mathcal{A}_t\,,\ \tilde{\rho}_t(\theta)\,\mathcal{A}_t \Big) \right\}\]where \(\rho_t(\theta)\) is the probability ratio:
\[\rho_t(\theta)\ =\ \frac {\pi_\theta(A_t|S_t)} {\pi_{\theta_\text{old}}(A_t|S_t)}\]and \(\tilde{\rho}_t(\theta)\) is its clipped version:
\[\tilde{\rho}_t(\theta)\ =\ \text{clip}\big( \rho_t(\theta), 1-\epsilon, 1+\epsilon\big)\]- ‘cross_entropy’
Straightforward categorical cross-entropy (from logits). This loss function does not make use of the advantages Adv. Instead, it minimizes the cross entropy between the behavior policy \(\pi_b(a|s)\) and the learned policy \(\pi_\theta(a|s)\):
\[J(\theta)\ =\ \hat{\mathbb{E}}_t\left\{ -\sum_a \pi_b(a|S_t)\, \log \pi_\theta(a|S_t) \right\}\]
- ppo_clip_eps : float, optional
The clipping parameter \(\epsilon\) in the PPO clipped surrogate loss. This option is only applicable if
update_strategy='ppo'
.- entropy_beta : float, optional
The coefficient of the entropy bonus term in the policy objective.
-
__call__
(self, s, use_target_model=False)¶ Draw an action from the current policy \(\pi(a|s)\).
Parameters: - s : state observation
A single state observation.
- use_target_model : bool, optional
Whether to use the target_model internally. If False (default), the predict_model is used.
Returns: - a : action
A single action proposed under the current policy.
-
batch_eval
(self, S, use_target_model=False)¶ Evaluate the policy on a batch of state observations.
Parameters: - S : nd array, shape: [batch_size, …]
A batch of state observations.
- use_target_model : bool, optional
Whether to use the target_model internally. If False (default), the predict_model is used.
Returns: - A : nd array, shape: [batch_size, …]
A batch of sampled actions.
-
batch_update
(self, S, A, Adv)¶ Update the policy on a batch of transitions.
Parameters: - S : nd array, shape: [batch_size, …]
A batch of state observations.
- A : nd array, shape: [batch_size, …]
A batch of actions taken by the behavior policy.
- Adv : 1d array, dtype: float, shape: [batch_size]
A value for the advantage \(\mathcal{A}(s,a) = q(s,a) - v(s)\). This might be sampled and/or estimated version of the true advantage.
Returns: - losses : dict
A dict of losses/metrics, of type
{name <str>: value <float>}
.
-
dist_params
(self, s, use_target_model=False)¶ Get the parameters of the (conditional) probability distribution \(\pi(a|s)\).
Parameters: - s : state observation
A single state observation.
- use_target_model : bool, optional
Whether to use the target_model internally. If False (default), the predict_model is used.
Returns: - *params : tuple of arrays
The raw distribution parameters.
-
greedy
(self, s, use_target_model=False)¶ Draw the greedy action, i.e. \(\arg\max_a\pi(a|s)\).
Parameters: - s : state observation
A single state observation.
- use_target_model : bool, optional
Whether to use the target_model internally. If False (default), the predict_model is used.
Returns: - a : action
A single action proposed under the current policy.
-
policy_loss_with_metrics
(self, Adv, A=None)¶ This method constructs the policy loss as a scalar-valued Tensor, together with a dictionary of metrics (also scalars).
This method may be overridden to construct a custom policy loss and/or to change the accompanying metrics.
Parameters: - Adv : 1d Tensor, shape: [batch_size]
A batch of advantages.
- A : nd Tensor, shape: [batch_size, …]
A batch of actions taken under the behavior policy. For some choices of policy loss, e.g.
update_strategy='sac'
this input is ignored.
Returns: - loss, metrics : (Tensor, dict of Tensors)
The policy loss along with some metrics, which is a dict of type
{name <str>: metric <Tensor>}
. The loss and each of the metrics (dict values) are scalar Tensors, i.e. Tensors withndim=0
.The
loss
is passed to a keras Model usingtrain_model.add_loss(loss)
. Similarly, each metric in the metric dict is passed to the model usingtrain_model.add_metric(metric, name=name, aggregation='mean')
.
-
sync_target_model
(self, tau=1.0)¶ Synchronize the target model with the primary model.
Parameters: - tau : float between 0 and 1, optional
The amount of exponential smoothing to apply in the target update:
\[w_\text{target}\ \leftarrow\ (1 - \tau)\,w_\text{target} + \tau\,w_\text{primary}\]
-
update
(self, s, a, advantage)¶ Update the policy.
Parameters: - s : state observation
A single state observation.
- a : action
A single action.
- advantage : float
A value for the advantage \(\mathcal{A}(s,a) = q(s,a) - v(s)\). This might be sampled and/or estimated version of the true advantage.
-
class
keras_gym.
SoftmaxPolicy
(function_approximator, update_strategy='vanilla', ppo_clip_eps=0.2, entropy_beta=0.01, random_seed=None)[source]¶ Updateable policy for discrete action spaces.
Parameters: - function_approximator : FunctionApproximator object
The main function approximator.
- update_strategy : str, callable, optional
The strategy for updating our policy. This determines the loss function that we use for our policy function approximator. If you wish to use a custom policy loss, you can override the
policy_loss_with_metrics()
method.Provided options are:
- ‘vanilla’
Plain vanilla policy gradient. The corresponding (surrogate) loss function that we use is:
\[J(\theta)\ =\ -\mathcal{A}(s,a)\,\ln\pi(a|s,\theta)\]- ‘ppo’
Proximal policy optimization uses a clipped proximal loss:
\[J(\theta)\ =\ \min\Big( r(\theta)\,\mathcal{A}(s,a)\,,\ \text{clip}\big( r(\theta), 1-\epsilon, 1+\epsilon\big) \,\mathcal{A}(s,a)\Big)\]where \(r(\theta)\) is the probability ratio:
\[r(\theta)\ =\ \frac {\pi(a|s,\theta)} {\pi(a|s,\theta_\text{old})}\]- ‘cross_entropy’
Straightforward categorical cross-entropy (from logits). This loss function does not make use of the advantages Adv. Instead, it minimizes the cross entropy between the behavior policy \(\pi_b(a|s)\) and the learned policy \(\pi_\theta(a|s)\):
\[J(\theta)\ =\ \hat{\mathbb{E}}_t\left\{ -\sum_a \pi_b(a|S_t)\, \log \pi_\theta(a|S_t) \right\}\]
- ppo_clip_eps : float, optional
The clipping parameter \(\epsilon\) in the PPO clipped surrogate loss. This option is only applicable if
update_strategy='ppo'
.- entropy_beta : float, optional
The coefficient of the entropy bonus term in the policy objective.
- random_seed : int, optional
Sets the random state to get reproducible results.
-
__call__
(self, s, use_target_model=False)[source]¶ Draw an action from the current policy \(\pi(a|s)\).
Parameters: - s : state observation
A single state observation.
- use_target_model : bool, optional
Whether to use the target_model internally. If False (default), the predict_model is used.
Returns: - a : action
A single action proposed under the current policy.
-
batch_eval
(self, S, use_target_model=False)¶ Evaluate the policy on a batch of state observations.
Parameters: - S : nd array, shape: [batch_size, …]
A batch of state observations.
- use_target_model : bool, optional
Whether to use the target_model internally. If False (default), the predict_model is used.
Returns: - A : nd array, shape: [batch_size, …]
A batch of sampled actions.
-
batch_update
(self, S, A, Adv)¶ Update the policy on a batch of transitions.
Parameters: - S : nd array, shape: [batch_size, …]
A batch of state observations.
- A : nd array, shape: [batch_size, …]
A batch of actions taken by the behavior policy.
- Adv : 1d array, dtype: float, shape: [batch_size]
A value for the advantage \(\mathcal{A}(s,a) = q(s,a) - v(s)\). This might be sampled and/or estimated version of the true advantage.
Returns: - losses : dict
A dict of losses/metrics, of type
{name <str>: value <float>}
.
-
dist_params
(self, s, use_target_model=False)¶ Get the parameters of the (conditional) probability distribution \(\pi(a|s)\).
Parameters: - s : state observation
A single state observation.
- use_target_model : bool, optional
Whether to use the target_model internally. If False (default), the predict_model is used.
Returns: - *params : tuple of arrays
The raw distribution parameters.
-
greedy
(self, s, use_target_model=False)¶ Draw the greedy action, i.e. \(\arg\max_a\pi(a|s)\).
Parameters: - s : state observation
A single state observation.
- use_target_model : bool, optional
Whether to use the target_model internally. If False (default), the predict_model is used.
Returns: - a : action
A single action proposed under the current policy.
-
policy_loss_with_metrics
(self, Adv, A=None)¶ This method constructs the policy loss as a scalar-valued Tensor, together with a dictionary of metrics (also scalars).
This method may be overridden to construct a custom policy loss and/or to change the accompanying metrics.
Parameters: - Adv : 1d Tensor, shape: [batch_size]
A batch of advantages.
- A : nd Tensor, shape: [batch_size, …]
A batch of actions taken under the behavior policy. For some choices of policy loss, e.g.
update_strategy='sac'
this input is ignored.
Returns: - loss, metrics : (Tensor, dict of Tensors)
The policy loss along with some metrics, which is a dict of type
{name <str>: metric <Tensor>}
. The loss and each of the metrics (dict values) are scalar Tensors, i.e. Tensors withndim=0
.The
loss
is passed to a keras Model usingtrain_model.add_loss(loss)
. Similarly, each metric in the metric dict is passed to the model usingtrain_model.add_metric(metric, name=name, aggregation='mean')
.
-
sync_target_model
(self, tau=1.0)¶ Synchronize the target model with the primary model.
Parameters: - tau : float between 0 and 1, optional
The amount of exponential smoothing to apply in the target update:
\[w_\text{target}\ \leftarrow\ (1 - \tau)\,w_\text{target} + \tau\,w_\text{primary}\]
-
update
(self, s, a, advantage)¶ Update the policy.
Parameters: - s : state observation
A single state observation.
- a : action
A single action.
- advantage : float
A value for the advantage \(\mathcal{A}(s,a) = q(s,a) - v(s)\). This might be sampled and/or estimated version of the true advantage.