rl8.nn package

Subpackages

Submodules

rl8.nn.functional module

Functional PyTorch definitions.

rl8.nn.functional.binary_mask_to_float_mask(mask: Tensor, /) Tensor[source]

Convert 0 and 1 elements in a binary mask to -inf and 0, respectively.

Parameters:

mask – Binary mask tensor.

Returns:

Float mask tensor where 0 indicates an UNPADDED or VALID value.

rl8.nn.functional.float_mask_to_binary_mask(mask: Tensor, /) Tensor[source]

Convert 0 and -inf elements into a boolean mask of True and False, respectively.

Parameters:

mask – Float mask tensor.

Returns:

Boolean mask tensor where True indicates an UNPADDED or VALID value.

rl8.nn.functional.generalized_advantage_estimate(batch: TensorDict, /, *, gae_lambda: float = 0.95, gamma: float = 0.95, inplace: bool = False, normalize_advantages: bool = True, return_returns: bool = True, reward_scale: float = 1.0) TensorDict[source]

Compute a Generalized Advantage Estimate (GAE) and, optionally, returns using value function estimates and rewards.

GAE is most commonly used with PPO for computing a policy loss that incentivizes “good” actions.

Parameters:
  • batch

    Tensordict of batch size [B, T + 1, ...] that contains the following keys:

    • ”rewards”: Environment transition rewards.

    • ”values”: Policy value function estimates.

  • gae_lambda – Generalized Advantage Estimation (GAE) hyperparameter for controlling the variance and bias tradeoff when estimating the state value function from collected environment transitions. A higher value allows higher variance while a lower value allows higher bias estimation but lower variance.

  • gamma – Discount reward factor often used in the Bellman operator for controlling the variance and bias tradeoff in collected experienced rewards. Note, this does not control the bias/variance of the state value estimation and only controls the weight future rewards have on the total discounted return.

  • inplace – Whether to store advantage and, optionally, return estimates in the given tensordict or whether to allocate a separate tensordict for the returned values.

  • normalize_advantages – Whether to normalize advantages using the mean and standard deviation of the advantage batch before storing in the returned tensordict.

  • return_returns – Whether to compute and return Monte Carlo return estimates with GAE.

  • reward_scale – Reward scale to use; useful for normalizing rewards for stabilizing learning and improving overall performance.

Returns:

A tensordict with at least advantages and, optionally, discounted returns.

rl8.nn.functional.mask_from_lengths(x: Tensor, lengths: Tensor, /) Tensor[source]

Return sequence mask that indicates UNPADDED or VALID values according to tensor lengths.

Parameters:
  • x – Tensor with shape [B, T, ...].

  • lengths – Tensor with shape [B] that indicates lengths of the T sequence for each B element in x.

Returns:

Sequence mask of shape [B, T].

rl8.nn.functional.masked_avg(x: Tensor, /, *, mask: None | Tensor = None, dim: int = 1, keepdim: bool = False) Tensor[source]

Apply a masked average to x along dim.

Useful for pooling potentially padded features.

Parameters:
  • x – Tensor with shape [B, T, ...] to apply pooling to.

  • mask – Mask with shape [B, T] indicating UNPADDED or VALID values.

  • dim – Dimension to pool along.

  • keepdim – Whether to keep the pooled dimension.

Returns:

Masked max of x along dim and the indices of those maximums.

rl8.nn.functional.masked_categorical_sample(x: Tensor, /, *, mask: None | Tensor = None, dim: int = 1) tuple[torch.Tensor, torch.Tensor][source]

Masked categorical sampling of x.

Typically used for sampling from outputs of masked_log_softmax().

Parameters:
  • x – Logits with shape [B, T, ...] to sample from.

  • mask – Mask with shape [B, T] indicating UNPADDED or VALID values.

  • dim – Dimension to gather sampled values along.

Returns:

Sampled logits and the indices of those sampled logits.

rl8.nn.functional.masked_log_softmax(x: Tensor, /, *, mask: None | Tensor = None, dim: int = -1) Tensor[source]

Apply a masked log softmax to x along dim.

Typically used for getting logits from a model that predicts a sequence. The output of this function is typically passed to masked_categorical_sample().

Parameters:
  • x – Tensor with shape [B, T, ...].

  • mask – Mask with shape [B, T] indicating UNPADDED or VALID values.

  • dim – Dimension to apply log softmax along.

Returns:

Logits.

rl8.nn.functional.masked_max(x: Tensor, /, *, mask: None | Tensor = None, dim: int = 1) tuple[torch.Tensor, torch.Tensor][source]

Apply a masked max to x along dim.

Useful for pooling potentially padded features.

Parameters:
  • x – Tensor with shape [B, T, ...] to apply pooling to.

  • mask – Mask with shape [B, T] indicating UNPADDED or VALID values.

  • dim – Dimension to pool along.

Returns:

Masked max of x along dim and the indices of those maximums.

rl8.nn.functional.ppo_losses(buffer_batch: TensorDict, sample_batch: TensorDict, sample_distribution: Distribution, /, *, clip_param: float = 0.2, dual_clip_param: None | float = 5.0, entropy_coeff: float = 0.0, vf_clip_param: float = 1.0, vf_coeff: float = 1.0) TensorDict[source]

Proximal Policy Optimization loss.

Includes a dual-clipped policy loss, value function estimate loss, and an optional entropy bonus loss. All losses are summed into a total loss and reduced with a mean operation.

Parameters:
  • buffer_batch

    Tensordict of batch size [B, ...] full of the following keys:

    • ”actions”: Policy action samples during environment transitions.

    • ”advantages”: Advantages from generalized_advantage_estimate().

    • ”logp”: Log probabilities of taking "actions".

    • ”returns”: Monte carlo return estimates.

  • sample_batch

    Tensordict from sampling a policy of batch size [B, ...] full of the following keys:

    • ”values”: Policy value function estimates.

  • sample_distribution – A distribution instance created from the model that provided sample_batch used for computing the policy loss and entropy bonus loss.

  • clip_param – PPO hyperparameter indicating the max distance the policy can update away from previously collected policy sample data with respect to likelihoods of taking actions conditioned on observations. This is the main innovation of PPO.

  • dual_clip_param – PPO hyperparameter that clips like clip_param but when advantage estimations are negative. Helps prevent instability for continuous action spaces when policies are making large updates. Leave None for this clip to not apply. Otherwise, typical values are around 5.

  • entropy_coeff – Entropy coefficient value. Weight of the entropy loss w.r.t. other components of total loss.

  • vf_clip_param – PPO hyperparameter similar to clip_param but for the value function estimate. A measure of max distance the model’s value function is allowed to update away from previous value function samples.

  • vf_coeff – Value function loss component weight. Only needs to be tuned when the policy and value function share parameters.

Returns:

A tensordict containing each of the loss components.

rl8.nn.functional.skip_connection(x: Tensor, y: Tensor, /, *, kind: None | str = 'cat', dim: int = -1) Tensor[source]

Perform a skip connection for x and y.

Parameters:
  • x – Skip connection seed with shape [B, T, ...].

  • y – Skip connection seed with same shape as x.

  • kind

    Type of skip connection to use. Options include:

    • ”residual” for a standard residual connection (summing outputs)

    • ”cat” for concatenating outputs

    • None for no skip connection

  • dim – Dimension to apply concatentation along. Only valid when kind is "cat"

Returns:

A tensor with shape depending on kind.

Module contents

Top-level PyTorch neural network extensions.