rl8.algorithms package
Module contents
Definitions related to PPO algorithms (data collection and training steps).
Algorithms assume environments are parallelized much like IsaacGym environments and are infinite-horizon with no terminal conditions. These assumptions allow the learning procedure to occur extremely fast even for complex, sequence-based models because:
Environments occur in parallel and are batched into a contingous buffer.
All environments are reset in parallel after a predetermined horizon is reached.
All operations occur on the same device, removing overhead associated with data transfers between devices.
- class rl8.algorithms.Algorithm(env_cls: EnvFactory, /, config: None | AlgorithmConfig = None)[source]
Bases:
GenericAlgorithmBase[AlgorithmHparams,AlgorithmState,Policy]An optimized feedforward PPO algorithm with common tricks for stabilizing and accelerating learning.
- Parameters:
env_cls – Highly parallelized environment for sampling experiences. Will be stepped for
horizoneachAlgorithm.collect()call.config – Algorithm config for building a feedforward PPO algorithm. See
AlgorithmConfigfor all parameters.
Examples
Instantiate an algorithm for a dummy environment and update the underlying policy once.
>>> from rl8 import AlgorithmConfig >>> from rl8.env import DiscreteDummyEnv >>> algo = AlgorithmConfig().build(DiscreteDummyEnv) >>> algo.collect() >>> algo.step()
- collect(*, env_config: None | dict[str, Any] = None, deterministic: bool = False) CollectStats[source]
Collect environment transitions and policy samples in a buffer.
This is one of the main
Algorithmmethods. This is usually called immediately prior toAlgorithm.step()to collect experiences used for learning.The environment is reset immediately prior to collecting transitions according to
horizons_per_env_reset. If the environment isn’t reset, then the last observation is used as the initial observation.This method sets the
bufferedflag to enable calling ofAlgorithm.step()so it isn’t called with dummy data.- Parameters:
env_config – Optional config to pass to the environment’s reset method. This isn’t used if the environment isn’t scheduled to be reset according to
horizons_per_env_reset.deterministic – Whether to sample from the policy deterministically. This is usally
Falseduring learning andTrueduring evaluation.
- Returns:
Summary statistics related to the collected experiences and policy samples.
- class rl8.algorithms.AlgorithmConfig(model: None | ~rl8.models._feedforward.Model = None, model_cls: None | ~rl8.models._feedforward.ModelFactory = None, model_config: None | dict[str, typing.Any] = None, distribution_cls: None | type[rl8.distributions.Distribution] = None, horizon: int = 32, horizons_per_env_reset: int = 1, num_envs: int = 8192, optimizer_cls: type[torch.optim.optimizer.Optimizer] = <class 'torch.optim.adam.Adam'>, optimizer_config: None | dict[str, typing.Any] = None, accumulate_grads: bool = False, enable_amp: bool = False, lr_schedule: None | list[tuple[int, float]] = None, lr_schedule_kind: ~typing.Literal['interp', 'step'] = 'step', entropy_coeff: float = 0.0, entropy_coeff_schedule: None | list[tuple[int, float]] = None, entropy_coeff_schedule_kind: ~typing.Literal['interp', 'step'] = 'step', gae_lambda: float = 0.95, gamma: float = 0.95, sgd_minibatch_size: None | int = None, num_sgd_iters: int = 4, shuffle_minibatches: bool = True, clip_param: float = 0.2, vf_clip_param: float = 5.0, dual_clip_param: None | float = None, vf_coeff: float = 1.0, target_kl_div: None | float = None, max_grad_norm: float = 5.0, normalize_advantages: bool = True, normalize_rewards: bool = True, device: str | ~torch.device | ~typing.Literal['auto'] = 'auto')[source]
Bases:
objectAlgorith config for building a feedforward PPO algorithm.
- model_cls: None | ModelFactory = None
Optional custom policy model definition. A model class is provided for you based on the environment instance’s specs if you don’t provide one. Defaults to a simple feedforward neural network.
- model_config: None | dict[str, Any] = None
Optional policy model config unpacked into the model during instantiation.
- distribution_cls: None | type[rl8.distributions.Distribution] = None
Custom policy action distribution class. If not provided, an action distribution class is inferred from the environment specs. Defaults to a categorical distribution for discrete actions and a normal distribution for continuous actions. Complex actions are not supported by default distributions.
- horizon: int = 32
Number of environment transitions to collect during
Algorithm.collect(). The environment resets according tohorizons_per_env_reset. Buffer size is [B, T] where T = horizon.
- horizons_per_env_reset: int = 1
Number of times
Algorithm.collect()can be called before resettingAlgorithm.env. Increase this for cross-horizon learning. Default 1 resets after every horizon.
- num_envs: int = 8192
Number of parallelized environment instances. Determines buffer size [B, T] where B = num_envs.
- optimizer_config: None | dict[str, Any] = None
Configuration passed to the optimizer during instantiation.
- accumulate_grads: bool = False
Whether to accumulate gradients across minibatches before stepping the optimizer. Increases effective batch size while minimizing memory usage.
- enable_amp: bool = False
Whether to enable Automatic Mixed Precision (AMP) for faster and more memory-efficient training.
- lr_schedule: None | list[tuple[int, float]] = None
Optional schedule controlling the optimizer’s learning rate over environment transitions. Keeps learning rate constant if not provided.
- lr_schedule_kind: Literal['interp', 'step'] = 'step'
Learning rate scheduler type if lr_schedule is provided. Options:
"step"(jump and hold) or"interp"(interpolate between values).
- entropy_coeff: float = 0.0
Entropy coefficient weight in total loss. Ignored if
entropy_coeff_scheduleis provided.
- entropy_coeff_schedule: None | list[tuple[int, float]] = None
Optional schedule overriding entropy_coeff based on number of environment transitions.
- entropy_coeff_schedule_kind: Literal['interp', 'step'] = 'step'
Entropy scheduler type. Options:
"step": jump and hold,"interp": interpolate between values.
- gae_lambda: float = 0.95
Generalized Advantage Estimation (GAE) λ parameter for controlling the variance and bias tradeoff when estimating the state value function from collected environment transitions. A higher value allows higher variance while a lower value allows higher bias estimation but lower variance.
- gamma: float = 0.95
Discount reward factor often used in the Bellman operator for controlling the variance and bias tradeoff in collected experienced rewards. Note, this does not control the bias/variance of the state value estimation and only controls the weight future rewards have on the total discounted return.
- sgd_minibatch_size: None | int = None
PPO hyperparameter for minibatch size during policy update. Larger minibatches reduce update variance and accelerate CUDA computations. If
None, the entire buffer is treated as one batch.
- shuffle_minibatches: bool = True
Whether to shuffle minibatches within
Algorithm.step(). Recommended, but not necessary if the minibatch size is large enough (e.g., the buffer is the batch).
- clip_param: float = 0.2
PPO hyperparameter indicating the max distance the policy can update away from previously collected policy sample data with respect to likelihoods of taking actions conditioned on observations. This is the main innovation of PPO.
- vf_clip_param: float = 5.0
PPO hyperparameter similar to
clip_parambut for the value function estimate. A measure of max distance the model’s value function is allowed to update away from previous value function samples.
- dual_clip_param: None | float = None
PPO hyperparameter that clips like
clip_parambut when advantage estimations are negative. Helps prevent instability for continuous action spaces when policies are making large updates. LeaveNonefor this clip to not apply. Otherwise, typical values are around5.
- vf_coeff: float = 1.0
Value function loss component weight. Only needs to be tuned when the policy and value function share parameters.
- target_kl_div: None | float = None
Target maximum KL divergence when updating the policy. If approximate KL divergence is greater than this value, then policy updates stop early for that algorithm step. If this is left `None then early stopping doesn’t occur. A higher value means the policy is allowed to diverge more from the previous policy during updates.
- max_grad_norm: float = 5.0
Max gradient norm allowed when updating the policy’s model within
Algorithm.step().
- normalize_advantages: bool = True
Whether to normalize advantages computed for GAE using the batch’s mean and standard deviation. This has been shown to generally improve convergence speed and performance and should usually be
True.
- normalize_rewards: bool = True
Whether to normalize rewards using reversed discounted returns as from https://arxiv.org/pdf/2005.12729.pdf. Reward normalization, although not exactly correct and optimal, typically improves convergence speed and performance and should usually be
True.
- device: str | device | Literal['auto'] = 'auto'
Device
Algorithm.env,Algorithm.buffer, andAlgorithm.policyall reside on.
- build(env_cls: EnvFactory) Algorithm[source]
Build and validate an :class:Algorithm` from a config.
- class rl8.algorithms.GenericAlgorithmBase[source]
Bases:
Generic[_AlgorithmHparams,_AlgorithmState,_Policy]The base class for PPO algorithm flavors.
- buffer: TensorDict
Environment experience buffer used for aggregating environment transition data and policy sample data. The same buffer object is shared whenever using
GenericAlgorithmBase.collect(). Buffer dimensions are determined bynum_envsandhorizonargs.
- buffer_spec: Composite
Tensor spec defining the environment experience buffer components and dimensions. Used for instantiating
GenericAlgorithmBase.bufferatGenericAlgorithmBaseinstantiation and eachGenericAlgorithmBase.step()call.
- entropy_scheduler: EntropyScheduler
Entropy scheduler for updating the
entropy_coeffafter eachGenericAlgorithmBase.step()call based on the number environment transitions collected and learned on. By default, the entropy scheduler does not actually update the entropy coefficient. The entropy scheduler only updates the entropy coefficient if anentropy_coeff_scheduleis provided.
- env: Env
Environment used for experience collection within the
GenericAlgorithmBase.collect()method. It’s ultimately up to the environment to make learning efficient by parallelizing simulations.
- grad_scaler: GradScaler
Used for enabling Automatic Mixed Precision (AMP). Handles gradient scaling for the optimizer. Not all optimizers and hyperparameters are compatible with gradient scaling.
- hparams: _AlgorithmHparams
PPO hyperparameters that’re constant throughout training and can drastically affect training performance.
- lr_scheduler: LRScheduler
Learning rate scheduler for updating
optimizerlearning rate after eachstepcall based on the number of environment transitions collected and learned on. By default, the learning scheduler does not actually alter theoptimizerlearning rate (it actually leaves it constant). The learning rate scheduler only alters the learning rate if alearning_rate_scheduleis provided.
- optimizer: Optimizer
Underlying optimizer for updating the policy’s model parameters. Instantiated from an
optimizer_clsandoptimizer_config. Defaults to the Adam optimizer with generally well-performing parameters.
- policy: _Policy
Policy constructed from the
model_cls,model_config, anddistribution_clskwargs. A default policy is constructed according to the environment’s observation and action specs if these policy args aren’t provided. The policy is what does all the action sampling withinGenericAlgorithmBase.collect()and is what is updated withinGenericAlgorithmBase.step().
- state: _AlgorithmState
Algorithm state for determining when to reset the environment, when the policy can be updated, etc..
- abstract collect(*, env_config: None | dict[str, Any] = None, deterministic: bool = False) CollectStats[source]
Collect environment transitions and policy samples in a buffer.
This is one of the main
GenericAlgorithmBasemethods. This is usually called immediately prior toGenericAlgorithmBase.step()to collect experiences used for learning.The environment is reset immediately prior to collecting transitions according to
horizons_per_env_reset. If the environment isn’t reset, then the last observation is used as the initial observation.This method sets the
bufferedflag to enable calling ofGenericAlgorithmBase.step()so it isn’t called with dummy data.- Parameters:
env_config – Optional config to pass to the environment’s reset method. This isn’t used if the environment isn’t scheduled to be reset according to
horizons_per_env_reset.deterministic – Whether to sample from the policy deterministically. This is usally
Falseduring learning andTrueduring evaluation.
- Returns:
Summary statistics related to the collected experiences and policy samples.
- property horizons_per_env_reset: int
Number of times
GenericAlgorithmBase.collect()can be called before resettingGenericAlgorithmBase.env.
- class rl8.algorithms.RecurrentAlgorithm(env_cls: EnvFactory, /, config: None | RecurrentAlgorithmConfig = None)[source]
Bases:
GenericAlgorithmBase[RecurrentAlgorithmHparams,RecurrentAlgorithmState,RecurrentPolicy]An optimized recurrent PPO algorithm with common tricks for stabilizing and accelerating learning.
- Parameters:
env_cls – Highly parallelized environment for sampling experiences. Instantiated with
env_config. Will be stepped forhorizoneachRecurrentAlgorithm.collect()call.config – Recurrent algorithm config for building a recurrent PPO algorithm. See
RecurrentAlgorithmConfigfor all parameters.
Examples
Instantiate an algorithm for a dummy environment and update tne underlying policy once.
>>> from rl8 import RecurrentAlgorithmConfig >>> from rl8.env import DiscreteDummyEnv >>> algo = RecurrentAlgorithmConfig.build(DiscreteDummyEnv) >>> algo.collect() >>> algo.step()
- collect(*, env_config: None | dict[str, Any] = None, deterministic: bool = False) CollectStats[source]
Collect environment transitions and policy samples in a buffer.
This is one of the main
RecurrentAlgorithmmethods. This is usually called immediately prior toRecurrentAlgorithm.step()to collect experiences used for learning.The environment is reset immediately prior to collecting transitions according to
horizons_per_env_reset. If the environment isn’t reset, then the last observation is used as the initial observation.This method sets the
bufferedflag to enable calling ofRecurrentAlgorithm.step()so it isn’t called with dummy data.- Parameters:
env_config – Optional config to pass to the environment’s reset method. This isn’t used if the environment isn’t scheduled to be reset according to
horizons_per_env_reset.deterministic – Whether to sample from the policy deterministically. This is usally
Falseduring learning andTrueduring evaluation.
- Returns:
Summary statistics related to the collected experiences and policy samples.
- class rl8.algorithms.RecurrentAlgorithmConfig(model: None | ~rl8.models._recurrent.RecurrentModel = None, model_cls: None | ~rl8.models._recurrent.RecurrentModelFactory = None, model_config: None | dict[str, typing.Any] = None, distribution_cls: None | type[rl8.distributions.Distribution] = None, horizon: int = 32, horizons_per_env_reset: int = 1, num_envs: int = 8192, seq_len: int = 4, seqs_per_state_reset: int = 8, optimizer_cls: type[torch.optim.optimizer.Optimizer] = <class 'torch.optim.adam.Adam'>, optimizer_config: None | dict[str, typing.Any] = None, accumulate_grads: bool = False, enable_amp: bool = False, lr_schedule: None | list[tuple[int, float]] = None, lr_schedule_kind: ~typing.Literal['interp', 'step'] = 'step', entropy_coeff: float = 0.0, entropy_coeff_schedule: None | list[tuple[int, float]] = None, entropy_coeff_schedule_kind: ~typing.Literal['interp', 'step'] = 'step', gae_lambda: float = 0.95, gamma: float = 0.95, sgd_minibatch_size: None | int = None, num_sgd_iters: int = 4, shuffle_minibatches: bool = True, clip_param: float = 0.2, vf_clip_param: float = 5.0, dual_clip_param: None | float = None, vf_coeff: float = 1.0, target_kl_div: None | float = None, max_grad_norm: float = 5.0, normalize_advantages: bool = True, normalize_rewards: bool = True, device: str | ~torch.device | ~typing.Literal['auto'] = 'auto')[source]
Bases:
objectRecurrent algorithm config for building a recurrent PPO algorithm.
- model: None | RecurrentModel = None
Model instance to use. Mutually exclusive with
model_cls.
- model_cls: None | RecurrentModelFactory = None
Optional custom policy model definition. A model class is provided for you based on the environment instance’s specs if you don’t provide one. Defaults to a simple feedforward neural network.
- model_config: None | dict[str, Any] = None
Optional policy model config unpacked into the model during instantiation.
- distribution_cls: None | type[rl8.distributions.Distribution] = None
Custom policy action distribution class. If not provided, an action distribution class is inferred from the environment specs. Defaults to a categorical distribution for discrete actions and a normal distribution for continuous actions. Complex actions are not supported by default distributions.
- horizon: int = 32
Number of environment transitions to collect during
RecurrentAlgorithm.collect(). The environment resets according tohorizons_per_env_reset. Buffer size is [B, T] where T = horizon.
- horizons_per_env_reset: int = 1
Number of times
RecurrentAlgorithm.collect()can be called before resettingRecurrentAlgorithm.env. Increase this for cross-horizon learning. Default 1 resets after every horizon.
- num_envs: int = 8192
Number of parallelized environment instances. Determines buffer size [B, T] where B = num_envs.
- seq_len: int = 4
Truncated backpropagation through time sequence length. Not necessarily the sequence length the recurrent states are propagated for prior to being reset. This parameter coupled with
seqs_per_state_resetcontrols how many environment transitions are made before recurrent model states are reset or reinitialized.
- seqs_per_state_reset: int = 8
Number of sequences made within
RecurrentAlgorithm.collect()before recurrent model states are reset or reinitialized. Recurrent model states are never reset or reinitialized if this parameter is negative.
- optimizer_config: None | dict[str, Any] = None
Configuration passed to the optimizer during instantiation.
- accumulate_grads: bool = False
Whether to accumulate gradients across minibatches before stepping the optimizer. Increases effective batch size while minimizing memory usage.
- enable_amp: bool = False
Whether to enable Automatic Mixed Precision (AMP) for faster and more memory-efficient training.
- lr_schedule: None | list[tuple[int, float]] = None
Optional schedule controlling the optimizer’s learning rate over environment transitions. Keeps learning rate constant if not provided.
- lr_schedule_kind: Literal['interp', 'step'] = 'step'
Learning rate scheduler type if lr_schedule is provided. Options:
"step"(jump and hold) or"interp"(interpolate between values).
- entropy_coeff: float = 0.0
Entropy coefficient weight in total loss. Ignored if
entropy_coeff_scheduleis provided.
- entropy_coeff_schedule: None | list[tuple[int, float]] = None
Optional schedule overriding entropy_coeff based on number of environment transitions.
- entropy_coeff_schedule_kind: Literal['interp', 'step'] = 'step'
Entropy scheduler type. Options:
"step": jump and hold,"interp": interpolate between values.
- gae_lambda: float = 0.95
Generalized Advantage Estimation (GAE) λ parameter for controlling the variance and bias tradeoff when estimating the state value function from collected environment transitions. A higher value allows higher variance while a lower value allows higher bias estimation but lower variance.
- gamma: float = 0.95
Discount reward factor often used in the Bellman operator for controlling the variance and bias tradeoff in collected experienced rewards. Note, this does not control the bias/variance of the state value estimation and only controls the weight future rewards have on the total discounted return.
- sgd_minibatch_size: None | int = None
PPO hyperparameter for minibatch size during policy update. Larger minibatches reduce update variance and accelerate CUDA computations. If
None, the entire buffer is treated as one batch.
- shuffle_minibatches: bool = True
Whether to shuffle minibatches within
RecurrentAlgorithm.step(). Recommended, but not necessary if the minibatch size is large enough (e.g., the buffer is the batch).
- clip_param: float = 0.2
PPO hyperparameter indicating the max distance the policy can update away from previously collected policy sample data with respect to likelihoods of taking actions conditioned on observations. This is the main innovation of PPO.
- vf_clip_param: float = 5.0
PPO hyperparameter similar to
clip_parambut for the value function estimate. A measure of max distance the model’s value function is allowed to update away from previous value function samples.
- dual_clip_param: None | float = None
PPO hyperparameter that clips like
clip_parambut when advantage estimations are negative. Helps prevent instability for continuous action spaces when policies are making large updates. LeaveNonefor this clip to not apply. Otherwise, typical values are around5.
- vf_coeff: float = 1.0
Value function loss component weight. Only needs to be tuned when the policy and value function share parameters.
- target_kl_div: None | float = None
Target maximum KL divergence when updating the policy. If approximate KL divergence is greater than this value, then policy updates stop early for that algorithm step. If this is left `None then early stopping doesn’t occur. A higher value means the policy is allowed to diverge more from the previous policy during updates.
- max_grad_norm: float = 5.0
Max gradient norm allowed when updating the policy’s model within
Algorithm.step().
- normalize_advantages: bool = True
Whether to normalize advantages computed for GAE using the batch’s mean and standard deviation. This has been shown to generally improve convergence speed and performance and should usually be
True.
- normalize_rewards: bool = True
Whether to normalize rewards using reversed discounted returns as from https://arxiv.org/pdf/2005.12729.pdf. Reward normalization, although not exactly correct and optimal, typically improves convergence speed and performance and should usually be
True.
- device: str | device | Literal['auto'] = 'auto'
Device
RecurrentAlgorithm.env,RecurrentAlgorithm.buffer, andRecurrentAlgorithm.policyall reside on.
- build(env_cls: EnvFactory) RecurrentAlgorithm[source]
Build and validate a :class:RecurrentAlgorithm` from a config.