Release Notes


  • Made keras-gym compatible with tensorflow v2.0 (unfortunately had to disable eager mode)
  • Added SoftActorCritic class
  • Added frozen_lake/sac script and notebook
  • Added atari/sac script, which is still WIP


Major update: support Box action spaces.

  • introduced keras_gym.proba_dists sub-module, which implements differentiable proability ditributions (incl. differentiable sample() methods)
  • removed policy-based losses in favor BaseUpdateablePolicy.policy_loss_with_metrics(), which now uses the differentiable ProbaDist objects
  • removed ConjointActorCritic (was redundant)
  • changed how we implement target models: no longer rely on global namespaces; instead we use keras.models.clone_model()
  • changed BaseFunctionApproximator.sync_target_model(): use model.{get,set}_weights()
  • added script and notebook for Pendulum-v0 with PPO


This is a relatively minor update. Just a couple of small bug fixes.

  • fixed logging, which was broken by abseil (dependence of tensorflow>=1.14)
  • added enable_logging helper
  • updated some docs


This version is another major overhaul. In particular, the FunctionApproximator class is introduced, which offers a unified interface for all function approximator types, i.e. state(-action) value functions and updateable policies. This makes it a lot easier to create your own custom function approximator, whereby you only ahve to define your own forward-pass by creating a subclass of FunctionApproximator and providing a body method. Further flexibility is provided by allowing the head method(s) to be overridden.

  • added FunctionApproximator class
  • refactored value functions and policies to just be a wrapper around a FunctionApproximator object
  • MILESTONE: got AlphaZero to work on ConnectFour (although this game is likely too simple to see the real power of AlphaZero - MCTS on its own works fine)


  • MILESTONE: got PPO working on Atari Pong
  • added PolicyKLDivergence and PolicyEntropy
  • added entropy_beta and ppo_clip_eps kwargs to updateable policies


  • optimized ActorCritic to avoid feeding in S three times instead of once
  • removed all mention of bootstrap_model
  • implemented PPO with ClippedSurrogateLoss


This is the second overhaul, a complete rewrite in fact. There was just too much of the old scikit-gym structure that was standing in the way of progress.

The main thing that changed in this version is that I ditch the notion of an algorithm. Instead, function approximators carry their own “update strategy”. In the case of Q-functions, this is ‘sarsa’, ‘q_learning’ etc., while policies have the options ‘vanilla’, ‘ppo’, etc.

Value functions carry another property that was previously attributed to algorithm objects. This is the bootstrap-n, i.e. the number of steps over which to delay bootstrapping.

This new structure accommodates for modularity much much better than the old structure.

  • removed algorithms, replaced by ‘bootstrap_n’ and ‘update_strategy’ settings on function approximators
  • implemented ExperienceReplayBuffer
  • milestone: added DQN implementation for Atari 2600 envs.
  • other than that.. too much to mention. It really was a complete rewrite


  • changed definitions of Q-functions to GenericQ and GenericQTypeII
  • added option for efficient bootstrapped updating (bootstrap_model argument in value functions, see example usage: NStepBootstrapV)
  • renamed ValuePolicy to ValueBasedPolicy


  • implemented base class for updateable policy objects
  • implemented first example of updateable policy: GenericSoftmaxPolicy
  • implemented predefined softmax policy: LinearSoftmaxPolicy
  • added first policy gradient algorithm: Reinforce
  • added REINFORCE example notebook
  • updated documentation


This was a MAJOR overhaul in which I ported everything from scikit-learn to Keras. The reason for this is that I was stuck on the implementation of policy gradient methods due to the lack of flexibility of the scikit-learn ecosystem. I chose Keras as a replacement, it’s nice an modular like scikit-learn, but in addition it’s much more flexible. In particular, the ability to provide custom loss functions has been the main selling point. Another selling point was that some environments require more sophisticated neural nets than a simple MLP, which is readily available in Keras.

  • added compatibility wrapper for scikit-learn function approximators
  • ported all value functions to use keras.Model
  • ported predefined models LinearV and LinearQ to keras
  • ported algorithms to keras
  • ported all notebooks to keras
  • changed name of the package keras-gym and root module keras_gym

Other changes:

  • added propensity score outputs to policy objects
  • created a stub for directly updateable policies


  • refactored BaseAlgorithm to simplify implementation (at the cost of more code, but it’s worth it)
  • refactored notebooks: they are now bundled by environment / algo type
  • added n-step bootstrap algorithms:
    • NStepQLearning
    • NStepSarsa
    • NStepExpectedSarsa


  • added algorithm: keras_gym.algorithms.ExpectedSarsa
  • added object: keras_gym.utils.ExperienceCache
  • rewrote MonteCarlo to use ExperienceCache


  • added algorithm: keras_gym.algorithms.MonteCarlo


  • added algorithm: keras_gym.algorithms.Sarsa


  • changed doc theme from sklearn to readthedocs


  • first working implementation value function + policy + algorithm
  • added first working example in a notebook
  • added algorithm: keras_gym.algorithms.QLearning