rl8.nn package
Subpackages
- rl8.nn.modules package
- Submodules
- rl8.nn.modules.activations module
- rl8.nn.modules.attention module
- rl8.nn.modules.embeddings module
- rl8.nn.modules.mlp module
- rl8.nn.modules.module module
- rl8.nn.modules.perceiver module
- rl8.nn.modules.skip module
- Module contents
Submodules
rl8.nn.functional module
Functional PyTorch definitions.
- rl8.nn.functional.binary_mask_to_float_mask(mask: Tensor, /) Tensor [source]
Convert
0
and1
elements in a binary mask to-inf
and0
, respectively.- Parameters:
mask – Binary mask tensor.
- Returns:
Float mask tensor where
0
indicates an UNPADDED or VALID value.
- rl8.nn.functional.float_mask_to_binary_mask(mask: Tensor, /) Tensor [source]
Convert
0
and-inf
elements into a boolean mask ofTrue
andFalse
, respectively.- Parameters:
mask – Float mask tensor.
- Returns:
Boolean mask tensor where
True
indicates an UNPADDED or VALID value.
- rl8.nn.functional.generalized_advantage_estimate(batch: TensorDict, /, *, gae_lambda: float = 0.95, gamma: float = 0.95, inplace: bool = False, normalize_advantages: bool = True, return_returns: bool = True, reward_scale: float = 1.0) TensorDict [source]
Compute a Generalized Advantage Estimate (GAE) and, optionally, returns using value function estimates and rewards.
GAE is most commonly used with PPO for computing a policy loss that incentivizes “good” actions.
- Parameters:
batch –
Tensordict of batch size
[B, T + 1, ...]
that contains the following keys:”rewards”: Environment transition rewards.
”values”: Policy value function estimates.
gae_lambda – Generalized Advantage Estimation (GAE) hyperparameter for controlling the variance and bias tradeoff when estimating the state value function from collected environment transitions. A higher value allows higher variance while a lower value allows higher bias estimation but lower variance.
gamma – Discount reward factor often used in the Bellman operator for controlling the variance and bias tradeoff in collected experienced rewards. Note, this does not control the bias/variance of the state value estimation and only controls the weight future rewards have on the total discounted return.
inplace – Whether to store advantage and, optionally, return estimates in the given tensordict or whether to allocate a separate tensordict for the returned values.
normalize_advantages – Whether to normalize advantages using the mean and standard deviation of the advantage batch before storing in the returned tensordict.
return_returns – Whether to compute and return Monte Carlo return estimates with GAE.
reward_scale – Reward scale to use; useful for normalizing rewards for stabilizing learning and improving overall performance.
- Returns:
A tensordict with at least advantages and, optionally, discounted returns.
- rl8.nn.functional.mask_from_lengths(x: Tensor, lengths: Tensor, /) Tensor [source]
Return sequence mask that indicates UNPADDED or VALID values according to tensor lengths.
- Parameters:
x – Tensor with shape
[B, T, ...]
.lengths – Tensor with shape
[B]
that indicates lengths of theT
sequence for each B element inx
.
- Returns:
Sequence mask of shape
[B, T]
.
- rl8.nn.functional.masked_avg(x: Tensor, /, *, mask: None | Tensor = None, dim: int = 1, keepdim: bool = False) Tensor [source]
Apply a masked average to
x
alongdim
.Useful for pooling potentially padded features.
- Parameters:
x – Tensor with shape
[B, T, ...]
to apply pooling to.mask – Mask with shape
[B, T]
indicating UNPADDED or VALID values.dim – Dimension to pool along.
keepdim – Whether to keep the pooled dimension.
- Returns:
Masked max of
x
alongdim
and the indices of those maximums.
- rl8.nn.functional.masked_categorical_sample(x: Tensor, /, *, mask: None | Tensor = None, dim: int = 1) tuple[torch.Tensor, torch.Tensor] [source]
Masked categorical sampling of
x
.Typically used for sampling from outputs of
masked_log_softmax()
.- Parameters:
x – Logits with shape
[B, T, ...]
to sample from.mask – Mask with shape
[B, T]
indicating UNPADDED or VALID values.dim – Dimension to gather sampled values along.
- Returns:
Sampled logits and the indices of those sampled logits.
- rl8.nn.functional.masked_log_softmax(x: Tensor, /, *, mask: None | Tensor = None, dim: int = -1) Tensor [source]
Apply a masked log softmax to
x
alongdim
.Typically used for getting logits from a model that predicts a sequence. The output of this function is typically passed to
masked_categorical_sample()
.- Parameters:
x – Tensor with shape
[B, T, ...]
.mask – Mask with shape
[B, T]
indicating UNPADDED or VALID values.dim – Dimension to apply log softmax along.
- Returns:
Logits.
- rl8.nn.functional.masked_max(x: Tensor, /, *, mask: None | Tensor = None, dim: int = 1) tuple[torch.Tensor, torch.Tensor] [source]
Apply a masked max to
x
alongdim
.Useful for pooling potentially padded features.
- Parameters:
x – Tensor with shape
[B, T, ...]
to apply pooling to.mask – Mask with shape
[B, T]
indicating UNPADDED or VALID values.dim – Dimension to pool along.
- Returns:
Masked max of
x
alongdim
and the indices of those maximums.
- rl8.nn.functional.ppo_losses(buffer_batch: TensorDict, sample_batch: TensorDict, sample_distribution: Distribution, /, *, clip_param: float = 0.2, dual_clip_param: None | float = 5.0, entropy_coeff: float = 0.0, vf_clip_param: float = 1.0, vf_coeff: float = 1.0) TensorDict [source]
Proximal Policy Optimization loss.
Includes a dual-clipped policy loss, value function estimate loss, and an optional entropy bonus loss. All losses are summed into a total loss and reduced with a mean operation.
- Parameters:
buffer_batch –
Tensordict of batch size
[B, ...]
full of the following keys:”actions”: Policy action samples during environment transitions.
”advantages”: Advantages from
generalized_advantage_estimate()
.”logp”: Log probabilities of taking
"actions"
.”returns”: Monte carlo return estimates.
sample_batch –
Tensordict from sampling a policy of batch size
[B, ...]
full of the following keys:”values”: Policy value function estimates.
sample_distribution – A distribution instance created from the model that provided
sample_batch
used for computing the policy loss and entropy bonus loss.clip_param – PPO hyperparameter indicating the max distance the policy can update away from previously collected policy sample data with respect to likelihoods of taking actions conditioned on observations. This is the main innovation of PPO.
dual_clip_param – PPO hyperparameter that clips like
clip_param
but when advantage estimations are negative. Helps prevent instability for continuous action spaces when policies are making large updates. LeaveNone
for this clip to not apply. Otherwise, typical values are around5
.entropy_coeff – Entropy coefficient value. Weight of the entropy loss w.r.t. other components of total loss.
vf_clip_param – PPO hyperparameter similar to
clip_param
but for the value function estimate. A measure of max distance the model’s value function is allowed to update away from previous value function samples.vf_coeff – Value function loss component weight. Only needs to be tuned when the policy and value function share parameters.
- Returns:
A tensordict containing each of the loss components.
- rl8.nn.functional.skip_connection(x: Tensor, y: Tensor, /, *, kind: None | str = 'cat', dim: int = -1) Tensor [source]
Perform a skip connection for
x
andy
.- Parameters:
x – Skip connection seed with shape
[B, T, ...]
.y – Skip connection seed with same shape as
x
.kind –
Type of skip connection to use. Options include:
”residual” for a standard residual connection (summing outputs)
”cat” for concatenating outputs
None
for no skip connection
dim – Dimension to apply concatentation along. Only valid when
kind
is"cat"
- Returns:
A tensor with shape depending on
kind
.
Module contents
Top-level PyTorch neural network extensions.