Training utilities for reinforcement learning algorithms.
Provides advantage estimation (GAE, Monte Carlo returns), policy gradient losses, value losses, and evaluation functions. See Training for computational tools.
Training utilities for reinforcement learning algorithms.
This module provides core computational functions for RL training, including advantage estimation, policy gradient losses, value function losses, and evaluation utilities. All functions operate on arrays for efficient batch processing.
Advantage Estimation
Advantage estimation computes the expected improvement of taking an action over the average. GAE balances bias and variance by combining n-step returns. Monte Carlo returns provide unbiased but high-variance estimates by summing rewards to episode termination.
Terminal states (dones = true) have zero value for the next state, truncating the advantage sum. This prevents bootstrapping across episode boundaries. Pass dones.(t) = terminated.(t) || truncated.(t) for compatibility with Gymnasium-style APIs.
Parameters:
rewards: Immediate rewards at each timestep
values: Value estimates V(s) from the critic at each timestep
dones: Terminal flags; true if episode ended (terminated OR truncated)
last_value: Value estimate V(s_T) for the state following the final timestep (used for bootstrapping unfinished episodes)
last_done: Whether the final timestep ended the episode
gamma: Discount factor, typically 0.99. Higher values increase weight of future rewards
gae_lambda: GAE lambda parameter, typically 0.95. Lambda = 0 gives one-step TD (low variance, high bias), lambda = 1 gives Monte Carlo (high variance, low bias)
Loss functions for policy optimization. Policy gradient losses measure how well actions align with advantages. Value losses train the critic to predict returns accurately.
policy_gradient_loss ~log_probs ~advantages ~normalize () computes the policy gradient loss.
Returns the negative mean of log probabilities weighted by advantages:
-mean(log_prob * advantage)
This gradient estimator increases the probability of actions with positive advantages and decreases probability of actions with negative advantages. Used by REINFORCE and A2C.
Advantage normalization (enabled by default) standardizes advantages to zero mean and unit variance, improving training stability by keeping gradients in a consistent range.
Parameters:
log_probs: Log probabilities log π(a|s) of actions taken
advantages: Advantage estimates for each action
normalize: Whether to normalize advantages (default: true)
Returns the negative mean of the clipped objective:
-mean(min(ratio * A, clip(ratio, 1-ε, 1+ε) * A))
where ratio = exp(log_prob - old_log_prob) and ε is clip_range.
The probability ratio measures how much the policy changed since data was collected. Clipping prevents excessively large updates that destabilize training. When ratio exceeds 1 + clip_range or falls below 1 - clip_range, the gradient is zeroed, limiting the policy change per update.
Advantages are automatically normalized to zero mean and unit variance.
Parameters:
log_probs: Log probabilities log π_new(a|s) from current policy
old_log_probs: Log probabilities log π_old(a|s) from policy that collected data
value_loss ~values ~returns ~clip_range () computes the value function loss.
Returns the mean squared error between predicted values and target returns:
mean((V - R)^2)
If clip is None, computes the mean-squared error (MSE) between values and returns.
If clip = Some (clip_range, old_values), applies PPO-style clipping:
In that case, PPO-style value clipping is applied: value_clipped = old_values + clamp(values - old_values, ±clip_range), and the loss for each element is max((values - returns)^2, (value_clipped - returns)^2). This prevents large critic updates that destabilize training.
Parameters:
values: Predicted values V(s) from the critic
returns: Target returns (from GAE or Monte Carlo)
(clip_range, old_values): Optional clip range and previous value
estimates before the update Time complexity: O(n) where n = length of arrays.
evaluate env ~policy ~n_episodes ~max_steps () evaluates a policy's performance.
Runs the policy for multiple episodes, collecting cumulative rewards and episode lengths. Returns statistics summarizing the policy's performance. Use this to assess training progress or compare different policies.
Episodes run until natural termination or truncation, or until max_steps is reached. The environment is automatically reset between episodes.
Parameters:
env: Environment to evaluate in
policy: Function mapping observations to actions. Should be deterministic for reproducible evaluation or stochastic for exploration assessment
n_episodes: Number of episodes to run (default: 10)
max_steps: Maximum steps per episode before truncation (default: 1000)
Time complexity: O(n_episodes × max_steps).
Utilities
Helper functions for data processing and diagnostics.
normalize arr ~eps ~unbiased () normalizes an array to zero mean and unit variance.
Returns a new array where elements are transformed to:
(x - mean) / (std + eps)
Numerical stability constant eps prevents division by zero when the array has no variance (default: 1e-8). When unbiased is true, the standard deviation uses the n-1 denominator.