REINFORCE (Monte Carlo Policy Gradient) is a classic policy gradient method that collects complete episodes and updates the policy using Monte Carlo return estimates. See Reinforce for detailed documentation.
REINFORCE algorithm (Monte Carlo Policy Gradient).
REINFORCE is a classic policy gradient algorithm that optimizes policies by following the gradient of expected return. It collects complete episodes, computes returns using Monte Carlo estimation, and updates the policy to increase the probability of actions that led to high returns.
Algorithm
REINFORCE follows these steps:
Collect a complete episode following the current policy
Compute discounted returns G_t for each timestep
Update policy parameters by gradient ascent on log π(a|s) × G_t
Optionally use a baseline (value function) to reduce variance
Usage
Basic usage:
open Fehu
(* Create policy network *)
let policy_net = Kaun.Layer.sequential [
Kaun.Layer.linear ~in_features:4 ~out_features:32 ();
Kaun.Layer.relu ();
Kaun.Layer.linear ~in_features:32 ~out_features:2 ();
] in
(* Initialize agent *)
let agent = Reinforce.create
~policy_network:policy_net
~n_actions:2
~rng:(Rune.Rng.key 42)
Reinforce.{ default_config with learning_rate = 0.001 }
in
(* Train *)
let agent = Reinforce.learn agent ~env ~total_timesteps:100_000 () in
(* Use trained policy *)
let action = Reinforce.predict agent obs
With baseline (actor-critic):
let value_net = Kaun.Layer.sequential [...] in
let agent = Reinforce.create
~policy_network:policy_net
~baseline_network:value_net
~n_actions:2
~rng:(Rune.Rng.key 42)
{ default_config with use_baseline = true }
in
let agent = Reinforce.learn agent ~env ~total_timesteps:100_000 ()
create ~policy_network ~baseline_network ~n_actions ~rng config creates a REINFORCE agent.
Parameters:
policy_network: Kaun network that outputs action logits
baseline_network: Optional value network (required if config.use_baseline = true)
n_actions: Number of discrete actions
rng: Random number generator for initialization
config: Algorithm configuration
The policy network should take observations and output logits of shape batch_size, n_actions. The baseline network (if provided) should output values of shape batch_size, 1.
update agent trajectory performs one REINFORCE update.
Computes returns from trajectory rewards, calculates policy gradients, and updates both policy and baseline (if present). Returns updated agent and training metrics.
The trajectory should contain one complete episode.
learn agent ~env ~total_timesteps ~callback () trains the agent.
Repeatedly collects episodes and performs updates until total_timesteps environment steps have been executed. The callback is called after each episode update with the iteration number and metrics. If the callback returns false, training stops early.
Time complexity: O(total_timesteps × network_forward_time).