package obandit

You can search for identifiers within the package.

in-package search v0.2.0

On This Page

Bandit Modules
Examples
References

package obandit

obandit
- Obandit
  - AlphaPhiUCBParam
  - AlphaUCBParam
  - Bandit
  - DecayingEpsilonGreedyParam
  - EpsilonGreedyParam
  - FixedExp3Param
  - HorizonExp3Param
  - KBanditParam
  - MakeAlphaPhiUCB
    
    P
  - MakeAlphaUCB
    
    P
  - MakeDecayingEpsilonGreedy
    
    P
  - MakeDecayingExp3
    
    P
  - MakeEpsilonGreedy
    
    P
  - MakeExp3
    
    P
  - MakeFixedExp3
    
    P
  - MakeHorizonExp3
    
    P
  - MakeParametrizableEpsilonGreedy
    
    P
  - MakeUCB1
    
    P
  - RangeParam
  - RangedBandit
  - RateBanditParam
  - WrapRange
    
    B
    
    R
  - WrapRange01
    
    B

Legend:
Library
Module
Module type
Parameter
Class
Class type

Ocaml Multi-Armed Bandits

Version 0.2 — homepage

Bandit Modules

This library implements multi-armed bandits as modules. A bandit module is obtained by calling a functor with a bandit module parameter. The parameter usually gives the number $K$ of arms and the hyperparameters of the bandit.

module type Bandit = sig ... end

A bandit algorithm.

$$ $$

Bandit modules are instanciated using functors. Depending on the algorithm type used, the module parameter varies.

For instance, the UCB1 bandit module for 3 arms is obtained with:

module UCB1 = let module P = struct let k=3 end in MakeUCB1(P)

The following algorithms are available:

The UCB family of algorithms.

We use the viewpoint of the survey [1].

MakeAlphaPhiUCB: $(\alpha,\psi)$-UCB

At time $t$, the $(\alpha,\psi)$-UCB algorithm[1] is taking action:

$$ \argmax_{i=1,\cdots,K} \quad \left[ \widehat{\mu_i}+(\psi^{*})^{-1}\left(\frac{\alpha\ln t}{T_i} \right) \right] $$

where $\alpha > 0$ is a hyperparameter, $\widehat{\mu_i}$ is the estimate of the average reward of arm $i$, $T_i$ is the number of times arm $ i$ was visited so far and $\psi^*$ denotes the Legendre-Fenchel transform of a convex function $\psi$.

The pseudo-regret $\bar{R_n}$ has the following bound at round $n$: $$ \bar{R_n} \leq \sum_{i:\Delta_i > 0} \left( \frac{\alpha \Delta_i}{\psi^* (\Delta_i / 2 )} \ln n + \frac{\alpha }{\alpha-2 } \right) $$

where $ \Delta_i = \mu^* - \mu_i $ is the suboptimality parameter of arm $ i$.

type banditEstimates = {

t : int;
(*
The index of the step
*)
a : int;
(*
The last action taken.
*)
nVisits : int list;
(*
The visit counts by arm.
*)
u : float list;
(*
The cumulative arm reward observations.
*)

}

The inner state of a bandit that maintains estimates of arm means.

module type AlphaPhiUCBParam = sig ... end

Use to instanciate a Bandit from MakeAlphaPhiUCB by giving $\alpha$ and $\phi$.

module MakeAlphaPhiUCB
  (P : AlphaPhiUCBParam) : 
  Bandit with type bandit = banditEstimates

The $(\alpha,\psi)$-UCB Bandit for stochastic regret minimization described in [1].

MakeAlphaUCB: $\alpha$-UCB

The $\alpha$-UCB algorithm[5] uses $ \psi(\lambda) = \lambda^2 / 8 $. It chooses the action:

$$ \argmax_{i=1,\cdots,K} \left[ \widehat{\mu_i} + \sqrt{ \frac{\alpha \ln t}{2 T_i} } \right] $$

This gives a pseudo-regret bound of:

$$ \bar{ R_n} \leq \sum_{i:\Delta_i > 0} \left( \frac{2 \alpha} { \Delta_i } \ln n + \frac{\alpha}{\alpha - 2} \right) $$

module type AlphaUCBParam = sig ... end

Use to instanciate a Bandit from MakeAlphaUCB by giving $\alpha$.

module MakeAlphaUCB
  (P : AlphaUCBParam) : 
  Bandit with type bandit = banditEstimates

The $\alpha$-UCB Bandit for stochastic regret minimization described in [1] .

MakeUCB1: UCB1

The UCB1 algorithm[5] uses $\alpha = 4$. It chooses the action:

$$ \argmax_{i=1,\cdots,K} \left[ \widehat{\mu_i} + \sqrt{ \frac{2 \ln t}{T_i} } \right] $$

At round $ n$, this gives a pseudo-regret bound of:

$$ \bar{R_n} \leq \sum_{i:\Delta_i > 0} \left( \frac{8}{\Delta_i} \ln n + 2 \right) $$

module type KBanditParam = sig ... end

Use to instanciate a Bandit from MakeUCB1.

module MakeUCB1 (P : KBanditParam) : Bandit with type bandit = banditEstimates

The UCB1 Bandit for stochastic regret minimization .

The Epsilon-Greedy family of algorithms.

MakeParametrizableEpsilonGreedy: $\epsilon$-Greedy with a parametrizable rate.

At round $t$, the $ \epsilon_t$-Greedy algorithm[5] takes action $\argmax_{i=1,\cdots,K} \widehat{\mu_i} $ with probability $ 1-\epsilon_t$ and an uniformly random action with probability $ \epsilon_t$.

module type RateBanditParam = sig ... end

Use to instanciate algorithms that need a parametrizable rate.

module MakeParametrizableEpsilonGreedy
  (P : RateBanditParam) : 
  Bandit with type bandit = banditEstimates

The $\epsilon$-Greedy Bandit with a parametrizable exploration rate.

MakeDecayingEpsilonGreedy: $ \epsilon_n$-Greedy with the decaying rate from `[5]`.

This uses the exploration rate decay: $$ \epsilon_t = \min \left\{ 1, \frac{cK}{d^2 t} \right\} $$ where $ d > 0 $ should be taken as a tight lower bound on $ \max_{i=1,\cdots,K} \Delta_i$ and $ c > 0$ is a hyperparameter.

module type DecayingEpsilonGreedyParam = sig ... end

Use to instanciate a Bandit from MakeDecayingEpsilonGreedy .

module MakeDecayingEpsilonGreedy
  (P : DecayingEpsilonGreedyParam) : 
  Bandit with type bandit = banditEstimates

The Epsilon-Greedy Bandit with the decaying exploration rate from [5].

MakeEpsilonGreedy: $ \epsilon_n$-Greedy with a fixed exploration rate.

This uses a fixed exploration rate $ \epsilon$.

module type EpsilonGreedyParam = sig ... end

Use to instanciate a Bandit from MakeEpsilonGreedy .

module MakeEpsilonGreedy
  (P : EpsilonGreedyParam) : 
  Bandit with type bandit = banditEstimates

The Epsilon-Greedy Bandit with a fixed exploration rate.

The Exp3 family of algorithms.

MakeExp3: EXP3 with a parametrizable rate.

At round $ t$, the EXP3 algorithm[1] draws an arm from a probability distribution $ p$ and updates this distribution with a softmax operator:

$ p_{i,t+1} = \frac{\exp ( - \eta_t \widetilde{L_{i,t}})}{\sum{k=1}^{K}\text{exp}(-\eta_t \widetilde{L_{k,t}})} $

where $\widetilde{L_{i,t}}$ is the cumulative probability-normalized loss at time $ t$ of arm $i$, $\eta_t$ is the rate at time $t$.

type banditPolicy = {

t : int;
(*
The index of the step $t$.
*)
a : int;
(*
The last action taken.
*)
w : float list;
(*
The weights of the arm that define the policy.
*)

}

The internal state of an Exp3 bandit

module MakeExp3 (P : RateBanditParam) : Bandit with type bandit = banditPolicy

The Exp3 Bandit for adversarial regret minimization with a parametrizable learning rate.

MakeDecayingExp3: EXP3 with the decaying rate from `[1]`.

This variant uses the learning rate decay:

$$ \eta_t = \sqrt{\frac{ln K}{tK}} $$

and enjoys the pseudo-regret bound: $$ \bar{R_n} \leq 2 \sqrt{nK \ln K} $$

module MakeDecayingExp3
  (P : KBanditParam) : 
  Bandit with type bandit = banditPolicy

The Exp3 Bandit for adversarial regret minimization with a decaying learning rate as per [1].

MakeFixedExp3: EXP3 with a fixed rate.

This uses a fixed rate $\eta$.

module type FixedExp3Param = sig ... end

Use to instanciate a Bandit from MakeFixedExp3 .

module MakeFixedExp3
  (P : FixedExp3Param) : 
  Bandit with type bandit = banditPolicy

The Exp3 Bandit for adversarial regret minimization with a decaying learning rate as per [1].

MakeHorizonExp3: EXP3 with a known horizon `[1]`.

This variant optimizes for a known horizon $ n$ and uses the learning rate:

$$ \eta = \sqrt{\frac{2 ln K}{nK}} $$

It has the pseudo-regret bound:

$$ \bar{R_n} \leq \sqrt{2 nK \ln K} $$

module type HorizonExp3Param = sig ... end

Use to instanciate a Bandit from MakeHorizonExp3 .

module MakeHorizonExp3
  (P : HorizonExp3Param) : 
  Bandit with type bandit = banditPolicy

The Exp3 Bandit for adversarial regret minimization with a horizon-based learning rate as per [1].

More Functors: The doubling trick.

Reward normalization in online stochastic and/or adversarial learning is a hard problem. While this is well studied in online learning [2][3][4], there is no well studied procedure for bandits yet. The WrapRange Functors applies the heuristic solution known as the doubling trick.

The WrapRange functor wraps a bandit algorithm with the doubling trick. This heuristic allows to use a bandit algorithm without knowing the reward ranges. All rewards are linearly rescaled to a range (initially given by a RangeParam). When a value is observed above the range, the bandit algorithm is restarted and the range interval is doubled in that direction.

A convenience WrapRange01 is provided for rewards that are initially thought to lie in $\left[0,1\right]$.

module type RangeParam = sig ... end

A Reward range.

type rangedAction =

| Reset of int
| Action of int

A ranged action: Action a in normal course of action, Reset a in case * the bandit was just restarted.

type 'b rangedBandit = {

bandit : 'b;
(*
The original type of the bandit.
*)
u : float;
(*
The upper reward bound.
*)
l : float;
(*
The lower reward bound.
*)

}

The type of a bandit with a range.

module type RangedBandit = sig ... end

The type of a bandit with reward scaling.

module WrapRange
  (R : RangeParam)
  (B : Bandit) : 
  RangedBandit with type bandit = B.bandit

module WrapRange01 (B : Bandit) : RangedBandit with type bandit = B.bandit

The WrapRange01 functor is a convenience aliasing of WrapRange with an initial "standard" range of $ \left[ 0,1 \right] $.

Examples

see test/test.ml for an example of bandit use.

References

[1] Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems, Sebastien Bubeck and Nicolo Cesa-Bianchi.

[2] Adaptive Subgradient methods for Online Learning and Stochastic Optimization, John Duchi , Elad Hazan and Yoram Singer.

[3] Normalized Online Learning, Stephane Ross, Paul Mineiro, John Langford

[4] Scale-Free Online Learning, Francesco Orabona, Dávid Pál

[5] Finite-time Analysis of the Multiarmed Bandit Problem, Peter Auer, Nicolo Cesa-Bianchi, Paul Fischer

package obandit

Bandit Modules

The UCB family of algorithms.

MakeAlphaPhiUCB: $(\alpha,\psi)$-UCB

MakeAlphaUCB: $\alpha$-UCB

MakeUCB1: UCB1

The Epsilon-Greedy family of algorithms.

MakeParametrizableEpsilonGreedy: $\epsilon$-Greedy with a parametrizable rate.

MakeDecayingEpsilonGreedy: $ \epsilon_n$-Greedy with the decaying rate from [5].

MakeEpsilonGreedy: $ \epsilon_n$-Greedy with a fixed exploration rate.

The Exp3 family of algorithms.

MakeExp3: EXP3 with a parametrizable rate.

MakeDecayingExp3: EXP3 with the decaying rate from [1].

MakeFixedExp3: EXP3 with a fixed rate.

MakeHorizonExp3: EXP3 with a known horizon [1].

More Functors: The doubling trick.

Examples

References

MakeDecayingEpsilonGreedy: $ \epsilon_n$-Greedy with the decaying rate from `[5]`.

MakeDecayingExp3: EXP3 with the decaying rate from `[1]`.

MakeHorizonExp3: EXP3 with a known horizon `[1]`.