Actor-Critics¶
keras_gym.ActorCritic |
A generic actor-critic, combining an updateable policy with a value function. |
keras_gym.SoftActorCritic |
Implementation of a soft actor-critic (SAC), which uses entropy regularization in the value function as well as in its policy updates. |
-
class
keras_gym.
ActorCritic
(policy, v_func, value_loss_weight=1.0)[source]¶ A generic actor-critic, combining an updateable policy with a value function.
The added value of using an
ActorCritic
to combine a policy with a value function is that it avoids having to feed in S (potentially very large) three times at training time. Instead, it only feeds it in once.Parameters: - policy : Policy object
- v_func : value-function object
A state value function \(v(s)\).
- value_loss_weight : float, optional
Relative weight to give to the value-function loss:
loss = policy_loss + value_loss_weight * value_loss
-
__call__
(self, s)¶ Draw an action from the current policy \(\pi(a|s)\) and get the expected value \(v(s)\).
Parameters: - s : state observation
A single state observation.
Returns: - a, v : tuple (1d array of floats, float)
Returns a pair representing \((a, v(s))\).
-
batch_eval
(self, S, use_target_model=False)¶ Evaluate the actor-critic on a batch of state observations.
Parameters: - S : nd array, shape: [batch_size, …]
A batch of state observations.
- use_target_model : bool, optional
Whether to use the target_model internally. If False (default), the predict_model is used.
Returns:
-
batch_update
(self, S, A, Rn, In, S_next, A_next=None)¶ Update both actor and critic on a batch of transitions.
Parameters: - S : nd array, shape: [batch_size, …]
A batch of state observations.
- A : nd Tensor, shape: [batch_size, …]
A batch of actions taken.
- Rn : 1d array, dtype: float, shape: [batch_size]
A batch of partial returns. For example, in n-step bootstrapping this is given by:
\[R^{(n)}_t\ =\ R_t + \gamma\,R_{t+1} + \dots \gamma^{n-1}\,R_{t+n-1}\]In other words, it’s the non-bootstrapped part of the n-step return.
- In : 1d array, dtype: float, shape: [batch_size]
A batch bootstrapping factor. For instance, in n-step bootstrapping this is given by \(I^{(n)}_t=\gamma^n\) if the episode is ongoing and \(I^{(n)}_t=0\) otherwise. This allows us to write the bootstrapped target as \(G^{(n)}_t=R^{(n)}_t+I^{(n)}_tQ(S_{t+n}, A_{t+n})\).
- S_next : nd array, shape: [batch_size, …]
A batch of next-state observations.
- A_next : 2d Tensor, shape: [batch_size, …]
A batch of (potential) next actions A_next. This argument is only used if
update_strategy='sarsa'
.
Returns: - losses : dict
A dict of losses/metrics, of type
{name <str>: value <float>}
.
-
dist_params
(self, s)¶ Get the distribution parameters under the current policy \(\pi(a|s)\) and get the expected value \(v(s)\).
Parameters: - s : state observation
A single state observation.
Returns: - dist_params, v : tuple (1d array of floats, float)
Returns a pair representing the distribution parameters of \(\pi(a|s)\) and the estimated state value \(v(s)\).
-
classmethod
from_func
(function_approximator, gamma=0.9, bootstrap_n=1, bootstrap_with_target_model=False, entropy_beta=0.01, update_strategy='vanilla', random_seed=None)[source]¶ Create instance directly from a
FunctionApproximator
object.Parameters: - function_approximator : FunctionApproximator object
The main function approximator.
- gamma : float, optional
The discount factor for discounting future rewards.
- bootstrap_n : positive int, optional
The number of steps in n-step bootstrapping. It specifies the number of steps over which we’re willing to delay bootstrapping. Large \(n\) corresponds to Monte Carlo updates and \(n=1\) corresponds to TD(0).
- bootstrap_with_target_model : bool, optional
Whether to use the target_model when constructing a bootstrapped target. If False (default), the primary predict_model is used.
- entropy_beta : float, optional
The coefficient of the entropy bonus term in the policy objective.
- update_strategy : str, callable, optional
The strategy for updating our policy. This determines the loss function that we use for our policy function approximator. If you wish to use a custom policy loss, you can override the
policy_loss_with_metrics()
method.Provided options are:
- ‘vanilla’
Plain vanilla policy gradient. The corresponding (surrogate) loss function that we use is:
\[J(\theta)\ =\ -\mathcal{A}(s,a)\,\ln\pi(a|s,\theta)\]- ‘ppo’
Proximal policy optimization uses a clipped proximal loss:
\[J(\theta)\ =\ \min\Big( r(\theta)\,\mathcal{A}(s,a)\,,\ \text{clip}\big( r(\theta), 1-\epsilon, 1+\epsilon\big) \,\mathcal{A}(s,a)\Big)\]where \(r(\theta)\) is the probability ratio:
\[r(\theta)\ =\ \frac {\pi(a|s,\theta)} {\pi(a|s,\theta_\text{old})}\]- ‘cross_entropy’
Straightforward categorical cross-entropy (from logits). This loss function does not make use of the advantages Adv. Instead, it minimizes the cross entropy between the behavior policy \(\pi_b(a|s)\) and the learned policy \(\pi_\theta(a|s)\):
\[J(\theta)\ =\ \hat{\mathbb{E}}_t\left\{ -\sum_a \pi_b(a|S_t)\, \log \pi_\theta(a|S_t) \right\}\]
- random_seed : int, optional
Sets the random state to get reproducible results.
-
greedy
(self, s)¶ Draw a greedy action \(a=\arg\max_{a'}\pi(a'|s)\) and get the expected value \(v(s)\).
Parameters: - s : state observation
A single state observation.
Returns: - a, v : tuple (1d array of floats, float)
Returns a pair representing \((a, v(s))\).
-
sync_target_model
(self, tau=1.0)¶ Synchronize the target model with the primary model.
Parameters: - tau : float between 0 and 1, optional
The amount of exponential smoothing to apply in the target update:
\[w_\text{target}\ \leftarrow\ (1 - \tau)\,w_\text{target} + \tau\,w_\text{primary}\]
-
update
(self, s, a, r, done)¶ Update both actor and critic.
Parameters: - s : state observation
A single state observation.
- a : action
A single action.
- r : float
A single observed reward.
- done : bool
Whether the episode has finished.
-
class
keras_gym.
SoftActorCritic
(policy, v_func, q_func1, q_func2, value_loss_weight=1.0)[source]¶ Implementation of a soft actor-critic (SAC), which uses entropy regularization in the value function as well as in its policy updates.
Parameters: - policy : a policy object
An updateable policy object \(\pi(a|s)\).
- v_func : v-function object
A state-action value function. This is used as the entropy-regularized value function (critic).
- q_func1 : q-function object
A type-I state-action value function. This is used as the target for both the policy (actor) and the state value function (critic).
- q_func2 : q-function object
Same as
q_func1
. SAC uses two q-functions to avoid overfitting due to overly optimistic value estimates.- value_loss_weight : float, optional
Relative weight to give to the value-function loss:
loss = policy_loss + value_loss_weight * value_loss
-
__call__
(self, s)¶ Draw an action from the current policy \(\pi(a|s)\) and get the expected value \(v(s)\).
Parameters: - s : state observation
A single state observation.
Returns: - a, v : tuple (1d array of floats, float)
Returns a pair representing \((a, v(s))\).
-
batch_eval
(self, S, use_target_model=False)¶ Evaluate the actor-critic on a batch of state observations.
Parameters: - S : nd array, shape: [batch_size, …]
A batch of state observations.
- use_target_model : bool, optional
Whether to use the target_model internally. If False (default), the predict_model is used.
Returns:
-
batch_update
(self, S, A, Rn, In, S_next, A_next=None)[source]¶ Update both actor and critic on a batch of transitions.
Parameters: - S : nd array, shape: [batch_size, …]
A batch of state observations.
- A : nd Tensor, shape: [batch_size, …]
A batch of actions taken.
- Rn : 1d array, dtype: float, shape: [batch_size]
A batch of partial returns. For example, in n-step bootstrapping this is given by:
\[R^{(n)}_t\ =\ R_t + \gamma\,R_{t+1} + \dots \gamma^{n-1}\,R_{t+n-1}\]In other words, it’s the non-bootstrapped part of the n-step return.
- In : 1d array, dtype: float, shape: [batch_size]
A batch bootstrapping factor. For instance, in n-step bootstrapping this is given by \(I^{(n)}_t=\gamma^n\) if the episode is ongoing and \(I^{(n)}_t=0\) otherwise. This allows us to write the bootstrapped target as \(G^{(n)}_t=R^{(n)}_t+I^{(n)}_tQ(S_{t+n}, A_{t+n})\).
- S_next : nd array, shape: [batch_size, …]
A batch of next-state observations.
- A_next : 2d Tensor, shape: [batch_size, …]
A batch of (potential) next actions A_next. This argument is only used if
update_strategy='sarsa'
.
Returns: - losses : dict
A dict of losses/metrics, of type
{name <str>: value <float>}
.
-
dist_params
(self, s)¶ Get the distribution parameters under the current policy \(\pi(a|s)\) and get the expected value \(v(s)\).
Parameters: - s : state observation
A single state observation.
Returns: - dist_params, v : tuple (1d array of floats, float)
Returns a pair representing the distribution parameters of \(\pi(a|s)\) and the estimated state value \(v(s)\).
-
classmethod
from_func
(function_approximator, gamma=0.9, bootstrap_n=1, q_type=None, entropy_beta=0.01, random_seed=None)[source]¶ Create instance directly from a
FunctionApproximator
object.Parameters: - function_approximator : FunctionApproximator object
The main function approximator.
- gamma : float, optional
The discount factor for discounting future rewards.
- bootstrap_n : positive int, optional
The number of steps in n-step bootstrapping. It specifies the number of steps over which we’re willing to delay bootstrapping. Large \(n\) corresponds to Monte Carlo updates and \(n=1\) corresponds to TD(0).
- q_type : 1 or 2, optional
Whether to model the q-function as type-I or type-II. This defaults to type-II for discrete action spaces and type-I otherwise.
- entropy_beta : float, optional
The coefficient of the entropy bonus term in the policy objective.
- random_seed : int, optional
Sets the random state to get reproducible results.
-
greedy
(self, s)¶ Draw a greedy action \(a=\arg\max_{a'}\pi(a'|s)\) and get the expected value \(v(s)\).
Parameters: - s : state observation
A single state observation.
Returns: - a, v : tuple (1d array of floats, float)
Returns a pair representing \((a, v(s))\).
-
sync_target_model
(self, tau=1.0)¶ Synchronize the target model with the primary model.
Parameters: - tau : float between 0 and 1, optional
The amount of exponential smoothing to apply in the target update:
\[w_\text{target}\ \leftarrow\ (1 - \tau)\,w_\text{target} + \tau\,w_\text{primary}\]
-
update
(self, s, a, r, done)¶ Update both actor and critic.
Parameters: - s : state observation
A single state observation.
- a : action
A single action.
- r : float
A single observed reward.
- done : bool
Whether the episode has finished.