rl8.algorithms package

Module contents

Definitions related to PPO algorithms (data collection and training steps).

Algorithms assume environments are parallelized much like IsaacGym environments and are infinite-horizon with no terminal conditions. These assumptions allow the learning procedure to occur extremely fast even for complex, sequence-based models because:

  • Environments occur in parallel and are batched into a contingous buffer.

  • All environments are reset in parallel after a predetermined horizon is reached.

  • All operations occur on the same device, removing overhead associated with data transfers between devices.

class rl8.algorithms.Algorithm(env_cls: EnvFactory, /, config: None | AlgorithmConfig = None)[source]

Bases: GenericAlgorithmBase[AlgorithmHparams, AlgorithmState, Policy]

An optimized feedforward PPO algorithm with common tricks for stabilizing and accelerating learning.

Parameters:
  • env_cls – Highly parallelized environment for sampling experiences. Will be stepped for horizon each Algorithm.collect() call.

  • config – Algorithm config for building a feedforward PPO algorithm. See AlgorithmConfig for all parameters.

Examples

Instantiate an algorithm for a dummy environment and update the underlying policy once.

>>> from rl8 import AlgorithmConfig
>>> from rl8.env import DiscreteDummyEnv
>>> algo = AlgorithmConfig().build(DiscreteDummyEnv)
>>> algo.collect()  
>>> algo.step()  
collect(*, env_config: None | dict[str, Any] = None, deterministic: bool = False) CollectStats[source]

Collect environment transitions and policy samples in a buffer.

This is one of the main Algorithm methods. This is usually called immediately prior to Algorithm.step() to collect experiences used for learning.

The environment is reset immediately prior to collecting transitions according to horizons_per_env_reset. If the environment isn’t reset, then the last observation is used as the initial observation.

This method sets the buffered flag to enable calling of Algorithm.step() so it isn’t called with dummy data.

Parameters:
  • env_config – Optional config to pass to the environment’s reset method. This isn’t used if the environment isn’t scheduled to be reset according to horizons_per_env_reset.

  • deterministic – Whether to sample from the policy deterministically. This is usally False during learning and True during evaluation.

Returns:

Summary statistics related to the collected experiences and policy samples.

step() StepStats[source]

Take a step with the algorithm, using collected environment experiences to update the policy.

Returns:

Data associated with the step (losses, loss coefficients, etc.).

validate() None[source]

Do some validation on all the tensor/tensordict shapes within the algorithm.

Helpful when the algorithm is throwing an error on mismatched tensor/tensordict sizes. Call this at least once before running the algorithm for peace of mind.

class rl8.algorithms.AlgorithmConfig(model: None | ~rl8.models._feedforward.Model = None, model_cls: None | ~rl8.models._feedforward.ModelFactory = None, model_config: None | dict[str, typing.Any] = None, distribution_cls: None | type[rl8.distributions.Distribution] = None, horizon: int = 32, horizons_per_env_reset: int = 1, num_envs: int = 8192, optimizer_cls: type[torch.optim.optimizer.Optimizer] = <class 'torch.optim.adam.Adam'>, optimizer_config: None | dict[str, typing.Any] = None, accumulate_grads: bool = False, enable_amp: bool = False, lr_schedule: None | list[tuple[int, float]] = None, lr_schedule_kind: ~typing.Literal['interp', 'step'] = 'step', entropy_coeff: float = 0.0, entropy_coeff_schedule: None | list[tuple[int, float]] = None, entropy_coeff_schedule_kind: ~typing.Literal['interp', 'step'] = 'step', gae_lambda: float = 0.95, gamma: float = 0.95, sgd_minibatch_size: None | int = None, num_sgd_iters: int = 4, shuffle_minibatches: bool = True, clip_param: float = 0.2, vf_clip_param: float = 5.0, dual_clip_param: None | float = None, vf_coeff: float = 1.0, target_kl_div: None | float = None, max_grad_norm: float = 5.0, normalize_advantages: bool = True, normalize_rewards: bool = True, device: str | ~torch.device | ~typing.Literal['auto'] = 'auto')[source]

Bases: object

Algorith config for building a feedforward PPO algorithm.

model: None | Model = None

Model instance to use. Mutually exclusive with model_cls.

model_cls: None | ModelFactory = None

Optional custom policy model definition. A model class is provided for you based on the environment instance’s specs if you don’t provide one. Defaults to a simple feedforward neural network.

model_config: None | dict[str, Any] = None

Optional policy model config unpacked into the model during instantiation.

distribution_cls: None | type[rl8.distributions.Distribution] = None

Custom policy action distribution class. If not provided, an action distribution class is inferred from the environment specs. Defaults to a categorical distribution for discrete actions and a normal distribution for continuous actions. Complex actions are not supported by default distributions.

horizon: int = 32

Number of environment transitions to collect during Algorithm.collect(). The environment resets according to horizons_per_env_reset. Buffer size is [B, T] where T = horizon.

horizons_per_env_reset: int = 1

Number of times Algorithm.collect() can be called before resetting Algorithm.env. Increase this for cross-horizon learning. Default 1 resets after every horizon.

num_envs: int = 8192

Number of parallelized environment instances. Determines buffer size [B, T] where B = num_envs.

optimizer_cls

alias of Adam

optimizer_config: None | dict[str, Any] = None

Configuration passed to the optimizer during instantiation.

accumulate_grads: bool = False

Whether to accumulate gradients across minibatches before stepping the optimizer. Increases effective batch size while minimizing memory usage.

enable_amp: bool = False

Whether to enable Automatic Mixed Precision (AMP) for faster and more memory-efficient training.

lr_schedule: None | list[tuple[int, float]] = None

Optional schedule controlling the optimizer’s learning rate over environment transitions. Keeps learning rate constant if not provided.

lr_schedule_kind: Literal['interp', 'step'] = 'step'

Learning rate scheduler type if lr_schedule is provided. Options: "step" (jump and hold) or "interp" (interpolate between values).

entropy_coeff: float = 0.0

Entropy coefficient weight in total loss. Ignored if entropy_coeff_schedule is provided.

entropy_coeff_schedule: None | list[tuple[int, float]] = None

Optional schedule overriding entropy_coeff based on number of environment transitions.

entropy_coeff_schedule_kind: Literal['interp', 'step'] = 'step'

Entropy scheduler type. Options: "step": jump and hold, "interp": interpolate between values.

gae_lambda: float = 0.95

Generalized Advantage Estimation (GAE) λ parameter for controlling the variance and bias tradeoff when estimating the state value function from collected environment transitions. A higher value allows higher variance while a lower value allows higher bias estimation but lower variance.

gamma: float = 0.95

Discount reward factor often used in the Bellman operator for controlling the variance and bias tradeoff in collected experienced rewards. Note, this does not control the bias/variance of the state value estimation and only controls the weight future rewards have on the total discounted return.

sgd_minibatch_size: None | int = None

PPO hyperparameter for minibatch size during policy update. Larger minibatches reduce update variance and accelerate CUDA computations. If None, the entire buffer is treated as one batch.

num_sgd_iters: int = 4

PPO hyperparameter for number of SGD iterations over the collected buffer.

shuffle_minibatches: bool = True

Whether to shuffle minibatches within Algorithm.step(). Recommended, but not necessary if the minibatch size is large enough (e.g., the buffer is the batch).

clip_param: float = 0.2

PPO hyperparameter indicating the max distance the policy can update away from previously collected policy sample data with respect to likelihoods of taking actions conditioned on observations. This is the main innovation of PPO.

vf_clip_param: float = 5.0

PPO hyperparameter similar to clip_param but for the value function estimate. A measure of max distance the model’s value function is allowed to update away from previous value function samples.

dual_clip_param: None | float = None

PPO hyperparameter that clips like clip_param but when advantage estimations are negative. Helps prevent instability for continuous action spaces when policies are making large updates. Leave None for this clip to not apply. Otherwise, typical values are around 5.

vf_coeff: float = 1.0

Value function loss component weight. Only needs to be tuned when the policy and value function share parameters.

target_kl_div: None | float = None

Target maximum KL divergence when updating the policy. If approximate KL divergence is greater than this value, then policy updates stop early for that algorithm step. If this is left `None then early stopping doesn’t occur. A higher value means the policy is allowed to diverge more from the previous policy during updates.

max_grad_norm: float = 5.0

Max gradient norm allowed when updating the policy’s model within Algorithm.step().

normalize_advantages: bool = True

Whether to normalize advantages computed for GAE using the batch’s mean and standard deviation. This has been shown to generally improve convergence speed and performance and should usually be True.

normalize_rewards: bool = True

Whether to normalize rewards using reversed discounted returns as from https://arxiv.org/pdf/2005.12729.pdf. Reward normalization, although not exactly correct and optimal, typically improves convergence speed and performance and should usually be True.

device: str | device | Literal['auto'] = 'auto'

Device Algorithm.env, Algorithm.buffer, and Algorithm.policy all reside on.

build(env_cls: EnvFactory) Algorithm[source]

Build and validate an :class:Algorithm` from a config.

class rl8.algorithms.GenericAlgorithmBase[source]

Bases: Generic[_AlgorithmHparams, _AlgorithmState, _Policy]

The base class for PPO algorithm flavors.

buffer: TensorDict

Environment experience buffer used for aggregating environment transition data and policy sample data. The same buffer object is shared whenever using GenericAlgorithmBase.collect(). Buffer dimensions are determined by num_envs and horizon args.

buffer_spec: Composite

Tensor spec defining the environment experience buffer components and dimensions. Used for instantiating GenericAlgorithmBase.buffer at GenericAlgorithmBase instantiation and each GenericAlgorithmBase.step() call.

entropy_scheduler: EntropyScheduler

Entropy scheduler for updating the entropy_coeff after each GenericAlgorithmBase.step() call based on the number environment transitions collected and learned on. By default, the entropy scheduler does not actually update the entropy coefficient. The entropy scheduler only updates the entropy coefficient if an entropy_coeff_schedule is provided.

env: Env

Environment used for experience collection within the GenericAlgorithmBase.collect() method. It’s ultimately up to the environment to make learning efficient by parallelizing simulations.

grad_scaler: GradScaler

Used for enabling Automatic Mixed Precision (AMP). Handles gradient scaling for the optimizer. Not all optimizers and hyperparameters are compatible with gradient scaling.

hparams: _AlgorithmHparams

PPO hyperparameters that’re constant throughout training and can drastically affect training performance.

lr_scheduler: LRScheduler

Learning rate scheduler for updating optimizer learning rate after each step call based on the number of environment transitions collected and learned on. By default, the learning scheduler does not actually alter the optimizer learning rate (it actually leaves it constant). The learning rate scheduler only alters the learning rate if a learning_rate_schedule is provided.

optimizer: Optimizer

Underlying optimizer for updating the policy’s model parameters. Instantiated from an optimizer_cls and optimizer_config. Defaults to the Adam optimizer with generally well-performing parameters.

policy: _Policy

Policy constructed from the model_cls, model_config, and distribution_cls kwargs. A default policy is constructed according to the environment’s observation and action specs if these policy args aren’t provided. The policy is what does all the action sampling within GenericAlgorithmBase.collect() and is what is updated within GenericAlgorithmBase.step().

state: _AlgorithmState

Algorithm state for determining when to reset the environment, when the policy can be updated, etc..

abstract collect(*, env_config: None | dict[str, Any] = None, deterministic: bool = False) CollectStats[source]

Collect environment transitions and policy samples in a buffer.

This is one of the main GenericAlgorithmBase methods. This is usually called immediately prior to GenericAlgorithmBase.step() to collect experiences used for learning.

The environment is reset immediately prior to collecting transitions according to horizons_per_env_reset. If the environment isn’t reset, then the last observation is used as the initial observation.

This method sets the buffered flag to enable calling of GenericAlgorithmBase.step() so it isn’t called with dummy data.

Parameters:
  • env_config – Optional config to pass to the environment’s reset method. This isn’t used if the environment isn’t scheduled to be reset according to horizons_per_env_reset.

  • deterministic – Whether to sample from the policy deterministically. This is usally False during learning and True during evaluation.

Returns:

Summary statistics related to the collected experiences and policy samples.

property horizons_per_env_reset: int

Number of times GenericAlgorithmBase.collect() can be called before resetting GenericAlgorithmBase.env.

memory_stats() MemoryStats[source]

Return current algorithm memory usage.

property params: dict[str, Any]

Return algorithm parameters.

abstract step() StepStats[source]

Take a step with the algorithm, using collected environment experiences to update the policy.

Returns:

Data associated with the step (losses, loss coefficients, etc.).

class rl8.algorithms.RecurrentAlgorithm(env_cls: EnvFactory, /, config: None | RecurrentAlgorithmConfig = None)[source]

Bases: GenericAlgorithmBase[RecurrentAlgorithmHparams, RecurrentAlgorithmState, RecurrentPolicy]

An optimized recurrent PPO algorithm with common tricks for stabilizing and accelerating learning.

Parameters:
  • env_cls – Highly parallelized environment for sampling experiences. Instantiated with env_config. Will be stepped for horizon each RecurrentAlgorithm.collect() call.

  • config – Recurrent algorithm config for building a recurrent PPO algorithm. See RecurrentAlgorithmConfig for all parameters.

Examples

Instantiate an algorithm for a dummy environment and update tne underlying policy once.

>>> from rl8 import RecurrentAlgorithmConfig
>>> from rl8.env import DiscreteDummyEnv
>>> algo = RecurrentAlgorithmConfig.build(DiscreteDummyEnv)
>>> algo.collect()  
>>> algo.step()  
collect(*, env_config: None | dict[str, Any] = None, deterministic: bool = False) CollectStats[source]

Collect environment transitions and policy samples in a buffer.

This is one of the main RecurrentAlgorithm methods. This is usually called immediately prior to RecurrentAlgorithm.step() to collect experiences used for learning.

The environment is reset immediately prior to collecting transitions according to horizons_per_env_reset. If the environment isn’t reset, then the last observation is used as the initial observation.

This method sets the buffered flag to enable calling of RecurrentAlgorithm.step() so it isn’t called with dummy data.

Parameters:
  • env_config – Optional config to pass to the environment’s reset method. This isn’t used if the environment isn’t scheduled to be reset according to horizons_per_env_reset.

  • deterministic – Whether to sample from the policy deterministically. This is usally False during learning and True during evaluation.

Returns:

Summary statistics related to the collected experiences and policy samples.

step() StepStats[source]

Take a step with the algorithm, using collected environment experiences to update the policy.

Returns:

Data associated with the step (losses, loss coefficients, etc.).

validate() None[source]

Do some validation on all the tensor/tensordict shapes within the algorithm.

Helpful when the algorithm is throwing an error on mismatched tensor/tensordict sizes. Call this at least once before running the algorithm for peace of mind.

class rl8.algorithms.RecurrentAlgorithmConfig(model: None | ~rl8.models._recurrent.RecurrentModel = None, model_cls: None | ~rl8.models._recurrent.RecurrentModelFactory = None, model_config: None | dict[str, typing.Any] = None, distribution_cls: None | type[rl8.distributions.Distribution] = None, horizon: int = 32, horizons_per_env_reset: int = 1, num_envs: int = 8192, seq_len: int = 4, seqs_per_state_reset: int = 8, optimizer_cls: type[torch.optim.optimizer.Optimizer] = <class 'torch.optim.adam.Adam'>, optimizer_config: None | dict[str, typing.Any] = None, accumulate_grads: bool = False, enable_amp: bool = False, lr_schedule: None | list[tuple[int, float]] = None, lr_schedule_kind: ~typing.Literal['interp', 'step'] = 'step', entropy_coeff: float = 0.0, entropy_coeff_schedule: None | list[tuple[int, float]] = None, entropy_coeff_schedule_kind: ~typing.Literal['interp', 'step'] = 'step', gae_lambda: float = 0.95, gamma: float = 0.95, sgd_minibatch_size: None | int = None, num_sgd_iters: int = 4, shuffle_minibatches: bool = True, clip_param: float = 0.2, vf_clip_param: float = 5.0, dual_clip_param: None | float = None, vf_coeff: float = 1.0, target_kl_div: None | float = None, max_grad_norm: float = 5.0, normalize_advantages: bool = True, normalize_rewards: bool = True, device: str | ~torch.device | ~typing.Literal['auto'] = 'auto')[source]

Bases: object

Recurrent algorithm config for building a recurrent PPO algorithm.

model: None | RecurrentModel = None

Model instance to use. Mutually exclusive with model_cls.

model_cls: None | RecurrentModelFactory = None

Optional custom policy model definition. A model class is provided for you based on the environment instance’s specs if you don’t provide one. Defaults to a simple feedforward neural network.

model_config: None | dict[str, Any] = None

Optional policy model config unpacked into the model during instantiation.

distribution_cls: None | type[rl8.distributions.Distribution] = None

Custom policy action distribution class. If not provided, an action distribution class is inferred from the environment specs. Defaults to a categorical distribution for discrete actions and a normal distribution for continuous actions. Complex actions are not supported by default distributions.

horizon: int = 32

Number of environment transitions to collect during RecurrentAlgorithm.collect(). The environment resets according to horizons_per_env_reset. Buffer size is [B, T] where T = horizon.

horizons_per_env_reset: int = 1

Number of times RecurrentAlgorithm.collect() can be called before resetting RecurrentAlgorithm.env. Increase this for cross-horizon learning. Default 1 resets after every horizon.

num_envs: int = 8192

Number of parallelized environment instances. Determines buffer size [B, T] where B = num_envs.

seq_len: int = 4

Truncated backpropagation through time sequence length. Not necessarily the sequence length the recurrent states are propagated for prior to being reset. This parameter coupled with seqs_per_state_reset controls how many environment transitions are made before recurrent model states are reset or reinitialized.

seqs_per_state_reset: int = 8

Number of sequences made within RecurrentAlgorithm.collect() before recurrent model states are reset or reinitialized. Recurrent model states are never reset or reinitialized if this parameter is negative.

optimizer_cls

alias of Adam

optimizer_config: None | dict[str, Any] = None

Configuration passed to the optimizer during instantiation.

accumulate_grads: bool = False

Whether to accumulate gradients across minibatches before stepping the optimizer. Increases effective batch size while minimizing memory usage.

enable_amp: bool = False

Whether to enable Automatic Mixed Precision (AMP) for faster and more memory-efficient training.

lr_schedule: None | list[tuple[int, float]] = None

Optional schedule controlling the optimizer’s learning rate over environment transitions. Keeps learning rate constant if not provided.

lr_schedule_kind: Literal['interp', 'step'] = 'step'

Learning rate scheduler type if lr_schedule is provided. Options: "step" (jump and hold) or "interp" (interpolate between values).

entropy_coeff: float = 0.0

Entropy coefficient weight in total loss. Ignored if entropy_coeff_schedule is provided.

entropy_coeff_schedule: None | list[tuple[int, float]] = None

Optional schedule overriding entropy_coeff based on number of environment transitions.

entropy_coeff_schedule_kind: Literal['interp', 'step'] = 'step'

Entropy scheduler type. Options: "step": jump and hold, "interp": interpolate between values.

gae_lambda: float = 0.95

Generalized Advantage Estimation (GAE) λ parameter for controlling the variance and bias tradeoff when estimating the state value function from collected environment transitions. A higher value allows higher variance while a lower value allows higher bias estimation but lower variance.

gamma: float = 0.95

Discount reward factor often used in the Bellman operator for controlling the variance and bias tradeoff in collected experienced rewards. Note, this does not control the bias/variance of the state value estimation and only controls the weight future rewards have on the total discounted return.

sgd_minibatch_size: None | int = None

PPO hyperparameter for minibatch size during policy update. Larger minibatches reduce update variance and accelerate CUDA computations. If None, the entire buffer is treated as one batch.

num_sgd_iters: int = 4

PPO hyperparameter for number of SGD iterations over the collected buffer.

shuffle_minibatches: bool = True

Whether to shuffle minibatches within RecurrentAlgorithm.step(). Recommended, but not necessary if the minibatch size is large enough (e.g., the buffer is the batch).

clip_param: float = 0.2

PPO hyperparameter indicating the max distance the policy can update away from previously collected policy sample data with respect to likelihoods of taking actions conditioned on observations. This is the main innovation of PPO.

vf_clip_param: float = 5.0

PPO hyperparameter similar to clip_param but for the value function estimate. A measure of max distance the model’s value function is allowed to update away from previous value function samples.

dual_clip_param: None | float = None

PPO hyperparameter that clips like clip_param but when advantage estimations are negative. Helps prevent instability for continuous action spaces when policies are making large updates. Leave None for this clip to not apply. Otherwise, typical values are around 5.

vf_coeff: float = 1.0

Value function loss component weight. Only needs to be tuned when the policy and value function share parameters.

target_kl_div: None | float = None

Target maximum KL divergence when updating the policy. If approximate KL divergence is greater than this value, then policy updates stop early for that algorithm step. If this is left `None then early stopping doesn’t occur. A higher value means the policy is allowed to diverge more from the previous policy during updates.

max_grad_norm: float = 5.0

Max gradient norm allowed when updating the policy’s model within Algorithm.step().

normalize_advantages: bool = True

Whether to normalize advantages computed for GAE using the batch’s mean and standard deviation. This has been shown to generally improve convergence speed and performance and should usually be True.

normalize_rewards: bool = True

Whether to normalize rewards using reversed discounted returns as from https://arxiv.org/pdf/2005.12729.pdf. Reward normalization, although not exactly correct and optimal, typically improves convergence speed and performance and should usually be True.

device: str | device | Literal['auto'] = 'auto'

Device RecurrentAlgorithm.env, RecurrentAlgorithm.buffer, and RecurrentAlgorithm.policy all reside on.

build(env_cls: EnvFactory) RecurrentAlgorithm[source]

Build and validate a :class:RecurrentAlgorithm` from a config.