package saga

  1. Overview
  2. Docs

Module Saga_tokenizers.TrainersSource

Training module for tokenization models.

Sourcetype t

Main trainer type

Sourcetype training_result = {
  1. model : Models.t;
  2. special_tokens : string list;
}

Training result

Training Configurations

Sourceval bpe : ?vocab_size:int -> ?min_frequency:int -> ?special_tokens:string list -> ?limit_alphabet:int -> ?initial_alphabet:string list -> ?continuing_subword_prefix:string -> ?end_of_word_suffix:string -> ?show_progress:bool -> ?max_token_length:int -> unit -> t

Create a BPE trainer.

  • parameter vocab_size

    Target vocabulary size (default: 30000)

  • parameter min_frequency

    Minimum frequency for tokens (default: 0)

  • parameter show_progress

    Show training progress (default: true)

  • parameter special_tokens

    List of special tokens (default: )

  • parameter limit_alphabet

    Maximum alphabet size (default: 1000)

  • parameter initial_alphabet

    Initial alphabet (default: )

  • parameter continuing_subword_prefix

    Prefix for continuing subwords (default: None)

  • parameter end_of_word_suffix

    Suffix for end of word (default: None)

Sourceval wordpiece : ?vocab_size:int -> ?min_frequency:int -> ?special_tokens:string list -> ?limit_alphabet:int -> ?initial_alphabet:string list -> ?continuing_subword_prefix:string -> ?end_of_word_suffix:string -> ?unk_token:string -> ?show_progress:bool -> unit -> t

Create a WordPiece trainer.

  • parameter vocab_size

    Target vocabulary size (default: 30000)

  • parameter min_frequency

    Minimum frequency for tokens (default: 0)

  • parameter show_progress

    Show training progress (default: true)

  • parameter special_tokens

    List of special tokens (default: )

  • parameter limit_alphabet

    Maximum alphabet size (default: 1000)

  • parameter initial_alphabet

    Initial alphabet (default: )

  • parameter continuing_subword_prefix

    Prefix for continuing subwords (default: "##")

Sourceval word_level : ?vocab_size:int -> ?min_frequency:int -> ?special_tokens:string list -> ?show_progress:bool -> unit -> t

Create a WordLevel trainer.

  • parameter vocab_size

    Target vocabulary size (default: 30000)

  • parameter min_frequency

    Minimum frequency for tokens (default: 0)

  • parameter show_progress

    Show training progress (default: true)

  • parameter special_tokens

    List of special tokens (default: )

Sourceval unigram : ?vocab_size:int -> ?n_sub_iterations:int -> ?shrinking_factor:float -> ?unk_token:string -> ?special_tokens:string list -> ?show_progress:bool -> ?initial_alphabet:string list -> ?max_piece_length:int -> unit -> t

Create a Unigram trainer.

  • parameter vocab_size

    Target vocabulary size (default: 8000)

  • parameter show_progress

    Show training progress (default: true)

  • parameter special_tokens

    List of special tokens (default: )

  • parameter shrinking_factor

    Shrinking factor (default: 0.75)

  • parameter unk_token

    Unknown token (default: None)

  • parameter max_piece_length

    Maximum piece length (default: 16)

  • parameter n_sub_iterations

    Number of sub-iterations (default: 2)

Sourceval chars : ?min_frequency:int -> ?special_tokens:string list -> ?show_progress:bool -> unit -> t

Create a character-level trainer.

  • parameter min_frequency

    Minimum frequency for characters (default: 0)

  • parameter special_tokens

    List of special tokens (default: )

  • parameter show_progress

    Show training progress (default: true)

Training Operations

Sourceval train : t -> files:string list -> ?model:Models.t -> unit -> training_result

Train a model on the given files.

  • parameter t

    The trainer configuration

  • parameter files

    List of training files

  • parameter model

    Optional existing model to continue training from

  • returns

    The trained model and special tokens

Sourceval train_from_iterator : t -> iterator:(unit -> string option) -> ?model:Models.t -> unit -> training_result

Train a model from an iterator.

  • parameter t

    The trainer configuration

  • parameter iterator

    Function that returns next line or None

  • parameter model

    Optional existing model to continue training from

  • returns

    The trained model and special tokens

Serialization

Sourceval to_json : t -> Yojson.Basic.t
Sourceval of_json : Yojson.Basic.t -> t