Rewards: Negative absolute position (-|position|), encouraging the agent to stay near the origin. Terminal states at boundaries yield reward -10.0.
Episode Termination:
Terminated: Agent reaches position -10.0 or +10.0 (boundaries)
Truncated: Episode exceeds 200 steps
Rendering: ASCII visualization showing agent position ('o') on a line.
Example
Train a simple policy to stay near the origin:
let rng = Rune.Rng.create () in
let env = Fehu_envs.Random_walk.make ~rng () in
let obs, _ = Fehu.Env.reset env () in
for _ = 1 to 100 do
let action = (* policy chooses 0 or 1 *) in
let t = Fehu.Env.step env action in
Printf.printf "Position: %.2f, Reward: %.2f\n"
(Rune.to_array t.observation).(0) t.reward
done
Tips
The environment is deterministic given the action sequence
Optimal policy alternates actions to minimize distance from origin
Good for testing value function approximation with continuous states