Module `Saga_tokenizers.Bpe`Source

Byte Pair Encoding (BPE) tokenization module

Core Types

Sourcetype t

BPE model

Sourcetype vocab = (string, int) Hashtbl.t

Vocabulary mapping tokens to indices

Sourcetype merges = (string * string) list

List of merge operations

Sourcetype config = {

vocab : vocab;
merges : merges;
cache_capacity : int;
dropout : float option;
unk_token : string option;
continuing_subword_prefix : string option;
end_of_word_suffix : string option;
fuse_unk : bool;
byte_fallback : bool;
ignore_merges : bool;

}

BPE configuration

Model Creation

Sourceval create : config -> t

create config creates a new BPE model with the given configuration

Sourceval from_files : vocab_file:string -> merges_file:string -> t

from_files ~vocab_file ~merges_file loads a BPE model from vocab.json and merges.txt files

Sourceval default : unit -> t

default () creates a BPE model with default configuration

Configuration Builder

Sourcemodule Builder : sig ... end

Tokenization

Sourcetype token = {

id : int;
value : string;
offsets : int * int;

}

Token with ID, string value, and character offsets

Sourceval tokenize : t -> string -> token list

tokenize model text tokenizes text into tokens

Sourceval token_to_id : t -> string -> int option

token_to_id model token returns the ID for a token

Sourceval id_to_token : t -> int -> string option

id_to_token model id returns the token for an ID

Vocabulary Management

Sourceval get_vocab : t -> (string * int) list

get_vocab model returns the vocabulary as a list of (token, id) pairs

Sourceval get_vocab_size : t -> int

get_vocab_size model returns the size of the vocabulary

Sourceval get_unk_token : t -> string option

get_unk_token model returns the unknown token if configured

Sourceval get_continuing_subword_prefix : t -> string option

get_continuing_subword_prefix model returns the continuing subword prefix if configured

Sourceval get_end_of_word_suffix : t -> string option

get_end_of_word_suffix model returns the end-of-word suffix if configured

Cache Management

Sourceval clear_cache : t -> unit

clear_cache model clears the tokenization cache

Sourceval resize_cache : t -> int -> unit

resize_cache model capacity resizes the cache

Serialization

Sourceval save : t -> path:string -> ?name:string -> unit -> unit

save model ~path ?name () saves the model to vocab.json and merges.txt files

Sourceval read_files : vocab_file:string -> merges_file:string -> vocab * merges

read_files ~vocab_file ~merges_file reads vocabulary and merges from files

Training

Sourcemodule Trainer : sig ... end

Install

dune-project
Dependency

Authors

Maintainers

Sources

doc/saga.tokenizers/Saga_tokenizers/Bpe/index.html

Module `Saga_tokenizers.Bpe`Source

Core Types

Model Creation

Configuration Builder

Tokenization

Vocabulary Management

Cache Management

Serialization

Training

package saga

Install

dune-project Dependency

Authors

Maintainers

Sources

doc/saga.tokenizers/Saga_tokenizers/Bpe/index.html

Module Saga_tokenizers.BpeSource

Core Types

Model Creation

Configuration Builder

Tokenization

Vocabulary Management

Cache Management

Serialization

Training

dune-project
Dependency

Module `Saga_tokenizers.Bpe`Source