package saga

  1. Overview
  2. Docs

Module Saga_tokenizers.BpeSource

Byte Pair Encoding (BPE) tokenization module

Core Types

Sourcetype t

BPE model

Sourcetype vocab = (string, int) Hashtbl.t

Vocabulary mapping tokens to indices

Sourcetype merges = (string * string) list

List of merge operations

Sourcetype config = {
  1. vocab : vocab;
  2. merges : merges;
  3. cache_capacity : int;
  4. dropout : float option;
  5. unk_token : string option;
  6. continuing_subword_prefix : string option;
  7. end_of_word_suffix : string option;
  8. fuse_unk : bool;
  9. byte_fallback : bool;
  10. ignore_merges : bool;
}

BPE configuration

Model Creation

Sourceval create : config -> t

create config creates a new BPE model with the given configuration

Sourceval from_files : vocab_file:string -> merges_file:string -> t

from_files ~vocab_file ~merges_file loads a BPE model from vocab.json and merges.txt files

Sourceval default : unit -> t

default () creates a BPE model with default configuration

Configuration Builder

Sourcemodule Builder : sig ... end

Tokenization

Sourcetype token = {
  1. id : int;
  2. value : string;
  3. offsets : int * int;
}

Token with ID, string value, and character offsets

Sourceval tokenize : t -> string -> token list

tokenize model text tokenizes text into tokens

Sourceval token_to_id : t -> string -> int option

token_to_id model token returns the ID for a token

Sourceval id_to_token : t -> int -> string option

id_to_token model id returns the token for an ID

Vocabulary Management

Sourceval get_vocab : t -> (string * int) list

get_vocab model returns the vocabulary as a list of (token, id) pairs

Sourceval get_vocab_size : t -> int

get_vocab_size model returns the size of the vocabulary

Sourceval get_unk_token : t -> string option

get_unk_token model returns the unknown token if configured

Sourceval get_continuing_subword_prefix : t -> string option

get_continuing_subword_prefix model returns the continuing subword prefix if configured

Sourceval get_end_of_word_suffix : t -> string option

get_end_of_word_suffix model returns the end-of-word suffix if configured

Cache Management

Sourceval clear_cache : t -> unit

clear_cache model clears the tokenization cache

Sourceval resize_cache : t -> int -> unit

resize_cache model capacity resizes the cache

Serialization

Sourceval save : t -> path:string -> ?name:string -> unit -> unit

save model ~path ?name () saves the model to vocab.json and merges.txt files

Sourceval read_files : vocab_file:string -> merges_file:string -> vocab * merges

read_files ~vocab_file ~merges_file reads vocabulary and merges from files

Training

Sourcemodule Trainer : sig ... end