Buffers accumulate trajectories for batch training or off-policy learning.
Experience collection buffers for reinforcement learning algorithms.
This module provides two buffer types for storing agent-environment interactions: replay buffers for off-policy algorithms and rollout buffers for on-policy algorithms. Both support efficient batch sampling and storage management.
Buffer Types
Replay buffers store transitions with complete state information, supporting off-policy algorithms like DQN, SAC, and TD3. They maintain a fixed-capacity circular buffer that overwrites oldest experiences when full.
Rollout buffers store sequential steps with optional value estimates and log probabilities, supporting on-policy algorithms like PPO and A2C. They compute advantages using Generalized Advantage Estimation (GAE) before returning batches.
Usage
Create a replay buffer and add transitions:
let buffer = Buffer.Replay.create ~capacity:10000 in
let transition =
{ observation; action; reward; next_observation; terminated; truncated }
in
Buffer.Replay.add buffer transition
Sample a batch for training:
let batch = Buffer.Replay.sample buffer ~rng ~batch_size:32 in
Array.iter (fun t -> (* train on transition *)) batch
Use rollout buffers for on-policy data:
let buffer = Buffer.Rollout.create ~capacity:2048 in
Buffer.Rollout.add buffer
{ observation; action; reward; terminated; truncated; value; log_prob };
Buffer.Rollout.compute_advantages buffer ~last_value ~last_done ~gamma:0.99 ~gae_lambda:0.95;
let steps, advantages, returns = Buffer.Rollout.get buffer
Represents a complete state transition containing both the current and next observations. Used by replay buffers for algorithms that learn from arbitrary past experiences.
Log probability log π(a|s) from policy, if available
*)
}
Rollout step for on-policy algorithms.
Represents a single timestep with optional policy information. Unlike transitions, steps do not store next observations since on-policy data is processed sequentially. Value estimates and log probabilities support policy gradient methods.