# Updateable Policies¶

 keras_gym.GaussianPolicy An updateable policy for environments with a continuous action space, i.e. keras_gym.SoftmaxPolicy Updateable policy for discrete action spaces.
class keras_gym.GaussianPolicy(function_approximator, update_strategy='vanilla', ppo_clip_eps=0.2, entropy_beta=0.01, random_seed=None)[source]

An updateable policy for environments with a continuous action space, i.e. a Box. It models the policy $$\pi_\theta(a|s)$$ as a normal distribution with conditional parameters $$(\mu_\theta(s), \sigma_\theta(s))$$.

Important

This environment requires that the env is with:

env = km.wrappers.BoxToReals(env)


This wrapper decompactifies the Box action space.

Parameters: function_approximator : FunctionApproximator object The main function approximator. update_strategy : str, optional The strategy for updating our policy. This typically determines the loss function that we use for our policy function approximator. Options are: ‘vanilla’ Plain vanilla policy gradient. The corresponding (surrogate) loss function that we use is: $J(\theta)\ =\ \hat{\mathbb{E}}_t \left\{ -\mathcal{A}_t\,\log\pi_\theta(A_t|S_t) \right\}$ where $$\mathcal{A}_t=\mathcal{A}(S_t,A_t)$$ is the advantage at time step $$t$$. ‘ppo’ Proximal policy optimization uses a clipped proximal loss: $J(\theta)\ =\ \hat{\mathbb{E}}_t \left\{ \min\Big( \rho_t(\theta)\,\mathcal{A}_t\,,\ \tilde{\rho}_t(\theta)\,\mathcal{A}_t \Big) \right\}$ where $$\rho_t(\theta)$$ is the probability ratio: $\rho_t(\theta)\ =\ \frac {\pi_\theta(A_t|S_t)} {\pi_{\theta_\text{old}}(A_t|S_t)}$ and $$\tilde{\rho}_t(\theta)$$ is its clipped version: $\tilde{\rho}_t(\theta)\ =\ \text{clip}\big( \rho_t(\theta), 1-\epsilon, 1+\epsilon\big)$ ‘cross_entropy’ Straightforward categorical cross-entropy (from logits). This loss function does not make use of the advantages Adv. Instead, it minimizes the cross entropy between the behavior policy $$\pi_b(a|s)$$ and the learned policy $$\pi_\theta(a|s)$$: $J(\theta)\ =\ \hat{\mathbb{E}}_t\left\{ -\sum_a \pi_b(a|S_t)\, \log \pi_\theta(a|S_t) \right\}$ ppo_clip_eps : float, optional The clipping parameter $$\epsilon$$ in the PPO clipped surrogate loss. This option is only applicable if update_strategy='ppo'. entropy_beta : float, optional The coefficient of the entropy bonus term in the policy objective.
__call__(self, s, use_target_model=False)

Draw an action from the current policy $$\pi(a|s)$$.

Parameters: s : state observation A single state observation. use_target_model : bool, optional Whether to use the target_model internally. If False (default), the predict_model is used. a : action A single action proposed under the current policy.
batch_eval(self, S, use_target_model=False)

Evaluate the policy on a batch of state observations.

Parameters: S : nd array, shape: [batch_size, …] A batch of state observations. use_target_model : bool, optional Whether to use the target_model internally. If False (default), the predict_model is used. A : nd array, shape: [batch_size, …] A batch of sampled actions.
batch_update(self, S, A, Adv)

Update the policy on a batch of transitions.

Parameters: S : nd array, shape: [batch_size, …] A batch of state observations. A : nd array, shape: [batch_size, …] A batch of actions taken by the behavior policy. Adv : 1d array, dtype: float, shape: [batch_size] A value for the advantage $$\mathcal{A}(s,a) = q(s,a) - v(s)$$. This might be sampled and/or estimated version of the true advantage. losses : dict A dict of losses/metrics, of type {name : value }.
dist_params(self, s, use_target_model=False)

Get the parameters of the (conditional) probability distribution $$\pi(a|s)$$.

Parameters: s : state observation A single state observation. use_target_model : bool, optional Whether to use the target_model internally. If False (default), the predict_model is used. *params : tuple of arrays The raw distribution parameters.
greedy(self, s, use_target_model=False)

Draw the greedy action, i.e. $$\arg\max_a\pi(a|s)$$.

Parameters: s : state observation A single state observation. use_target_model : bool, optional Whether to use the target_model internally. If False (default), the predict_model is used. a : action A single action proposed under the current policy.
policy_loss_with_metrics(self, Adv, A=None)

This method constructs the policy loss as a scalar-valued Tensor, together with a dictionary of metrics (also scalars).

This method may be overridden to construct a custom policy loss and/or to change the accompanying metrics.

Parameters: Adv : 1d Tensor, shape: [batch_size] A batch of advantages. A : nd Tensor, shape: [batch_size, …] A batch of actions taken under the behavior policy. For some choices of policy loss, e.g. update_strategy='sac' this input is ignored. loss, metrics : (Tensor, dict of Tensors) The policy loss along with some metrics, which is a dict of type {name : metric }. The loss and each of the metrics (dict values) are scalar Tensors, i.e. Tensors with ndim=0. The loss is passed to a keras Model using train_model.add_loss(loss). Similarly, each metric in the metric dict is passed to the model using train_model.add_metric(metric, name=name, aggregation='mean').
sync_target_model(self, tau=1.0)

Synchronize the target model with the primary model.

Parameters: tau : float between 0 and 1, optional The amount of exponential smoothing to apply in the target update: $w_\text{target}\ \leftarrow\ (1 - \tau)\,w_\text{target} + \tau\,w_\text{primary}$
update(self, s, a, advantage)

Update the policy.

Parameters: s : state observation A single state observation. a : action A single action. advantage : float A value for the advantage $$\mathcal{A}(s,a) = q(s,a) - v(s)$$. This might be sampled and/or estimated version of the true advantage.
class keras_gym.SoftmaxPolicy(function_approximator, update_strategy='vanilla', ppo_clip_eps=0.2, entropy_beta=0.01, random_seed=None)[source]

Updateable policy for discrete action spaces.

Parameters: function_approximator : FunctionApproximator object The main function approximator. update_strategy : str, callable, optional The strategy for updating our policy. This determines the loss function that we use for our policy function approximator. If you wish to use a custom policy loss, you can override the policy_loss_with_metrics() method. Provided options are: ‘vanilla’ Plain vanilla policy gradient. The corresponding (surrogate) loss function that we use is: $J(\theta)\ =\ -\mathcal{A}(s,a)\,\ln\pi(a|s,\theta)$ ‘ppo’ Proximal policy optimization uses a clipped proximal loss: $J(\theta)\ =\ \min\Big( r(\theta)\,\mathcal{A}(s,a)\,,\ \text{clip}\big( r(\theta), 1-\epsilon, 1+\epsilon\big) \,\mathcal{A}(s,a)\Big)$ where $$r(\theta)$$ is the probability ratio: $r(\theta)\ =\ \frac {\pi(a|s,\theta)} {\pi(a|s,\theta_\text{old})}$ ‘cross_entropy’ Straightforward categorical cross-entropy (from logits). This loss function does not make use of the advantages Adv. Instead, it minimizes the cross entropy between the behavior policy $$\pi_b(a|s)$$ and the learned policy $$\pi_\theta(a|s)$$: $J(\theta)\ =\ \hat{\mathbb{E}}_t\left\{ -\sum_a \pi_b(a|S_t)\, \log \pi_\theta(a|S_t) \right\}$ ppo_clip_eps : float, optional The clipping parameter $$\epsilon$$ in the PPO clipped surrogate loss. This option is only applicable if update_strategy='ppo'. entropy_beta : float, optional The coefficient of the entropy bonus term in the policy objective. random_seed : int, optional Sets the random state to get reproducible results.
__call__(self, s, use_target_model=False)[source]

Draw an action from the current policy $$\pi(a|s)$$.

Parameters: s : state observation A single state observation. use_target_model : bool, optional Whether to use the target_model internally. If False (default), the predict_model is used. a : action A single action proposed under the current policy.
batch_eval(self, S, use_target_model=False)

Evaluate the policy on a batch of state observations.

Parameters: S : nd array, shape: [batch_size, …] A batch of state observations. use_target_model : bool, optional Whether to use the target_model internally. If False (default), the predict_model is used. A : nd array, shape: [batch_size, …] A batch of sampled actions.
batch_update(self, S, A, Adv)

Update the policy on a batch of transitions.

Parameters: S : nd array, shape: [batch_size, …] A batch of state observations. A : nd array, shape: [batch_size, …] A batch of actions taken by the behavior policy. Adv : 1d array, dtype: float, shape: [batch_size] A value for the advantage $$\mathcal{A}(s,a) = q(s,a) - v(s)$$. This might be sampled and/or estimated version of the true advantage. losses : dict A dict of losses/metrics, of type {name : value }.
dist_params(self, s, use_target_model=False)

Get the parameters of the (conditional) probability distribution $$\pi(a|s)$$.

Parameters: s : state observation A single state observation. use_target_model : bool, optional Whether to use the target_model internally. If False (default), the predict_model is used. *params : tuple of arrays The raw distribution parameters.
greedy(self, s, use_target_model=False)

Draw the greedy action, i.e. $$\arg\max_a\pi(a|s)$$.

Parameters: s : state observation A single state observation. use_target_model : bool, optional Whether to use the target_model internally. If False (default), the predict_model is used. a : action A single action proposed under the current policy.
policy_loss_with_metrics(self, Adv, A=None)

This method constructs the policy loss as a scalar-valued Tensor, together with a dictionary of metrics (also scalars).

This method may be overridden to construct a custom policy loss and/or to change the accompanying metrics.

Parameters: Adv : 1d Tensor, shape: [batch_size] A batch of advantages. A : nd Tensor, shape: [batch_size, …] A batch of actions taken under the behavior policy. For some choices of policy loss, e.g. update_strategy='sac' this input is ignored. loss, metrics : (Tensor, dict of Tensors) The policy loss along with some metrics, which is a dict of type {name : metric }. The loss and each of the metrics (dict values) are scalar Tensors, i.e. Tensors with ndim=0. The loss is passed to a keras Model using train_model.add_loss(loss). Similarly, each metric in the metric dict is passed to the model using train_model.add_metric(metric, name=name, aggregation='mean').
sync_target_model(self, tau=1.0)

Synchronize the target model with the primary model.

Parameters: tau : float between 0 and 1, optional The amount of exponential smoothing to apply in the target update: $w_\text{target}\ \leftarrow\ (1 - \tau)\,w_\text{target} + \tau\,w_\text{primary}$
update(self, s, a, advantage)

Update the policy.

Parameters: s : state observation A single state observation. a : action A single action. advantage : float A value for the advantage $$\mathcal{A}(s,a) = q(s,a) - v(s)$$. This might be sampled and/or estimated version of the true advantage.