package saga
Install
dune-project
Dependency
Authors
Maintainers
Sources
sha256=93abc49d075a1754442ccf495645bc4fdc83e4c66391ec8aca8fa15d2b4f44d2
sha512=5eb958c51f30ae46abded4c96f48d1825f79c7ce03f975f9a6237cdfed0d62c0b4a0774296694def391573d849d1f869919c49008acffca95946b818ad325f6f
doc/saga.tokenizers/Saga_tokenizers/Tokenizer/index.html
Module Saga_tokenizers.TokenizerSource
normalizer tokenizer retrieves the configured normalizer.
Returns None if no normalizer is set. The normalizer is applied before all other processing stages to clean and normalize text.
with_normalizer tokenizer norm replaces the tokenizer's normalizer.
Pass None to remove the normalization step entirely. Pass Some norm to install a new normalizer. Returns updated tokenizer.
let tokenizer = Tokenizer.bpe () in
let tokenizer = Tokenizer.with_normalizer tokenizer
(Some (Normalizers.sequence [
Normalizers.nfd ();
Normalizers.lowercase ();
Normalizers.strip_accents ();
]))pre_tokenizer tokenizer retrieves the configured pre-tokenizer.
Returns None if no pre-tokenizer is set. The pre-tokenizer splits text into pieces before vocabulary-based encoding.
with_pre_tokenizer tokenizer pre replaces the tokenizer's pre-tokenizer.
Pass None to remove pre-tokenization (text processed as-is). Pass Some pre to install a new pre-tokenizer. Returns updated tokenizer.
let tokenizer = Tokenizer.bpe () in
let tokenizer = Tokenizer.with_pre_tokenizer tokenizer
(Some (Pre_tokenizers.byte_level ~add_prefix_space:true ()))post_processor tokenizer retrieves the configured post-processor.
Returns None if no post-processor is set. The post-processor adds special tokens and sets type IDs after encoding.
with_post_processor tokenizer post replaces the tokenizer's post-processor.
Pass None to remove post-processing. Pass Some post to install a new post-processor. Returns updated tokenizer.
let tokenizer = Tokenizer.bpe () in
let tokenizer = Tokenizer.with_post_processor tokenizer
(Some (Processors.bert_processing
~sep:("[SEP]", 102) ~cls:("[CLS]", 101) ()))decoder tokenizer retrieves the configured decoder.
Returns None if no decoder is set. The decoder converts token IDs back to text.
with_decoder tokenizer dec replaces the tokenizer's decoder.
Pass None to use default decoding (concatenate tokens). Pass Some dec to install a new decoder. Returns updated tokenizer.
let tokenizer = Tokenizer.bpe () in
let tokenizer = Tokenizer.with_decoder tokenizer
(Some (Decoders.byte_level ()))with_specials tokenizer specials replaces the special tokens with the provided list.
add_specials tokenizer specials extends the set of special tokens.
Special Token Roles
These functions configure which token strings serve specific roles in the tokenizer (BOS, EOS, PAD, UNK). This follows HuggingFace's design where roles are separate from token properties.
bos_token tokenizer returns the beginning-of-sequence token string, if configured.
set_bos_token tokenizer token sets which token serves as beginning-of-sequence marker. Pass None to unset. The token should already be in the vocabulary.
eos_token tokenizer returns the end-of-sequence token string, if configured.
set_eos_token tokenizer token sets which token serves as end-of-sequence marker. Pass None to unset.
pad_token tokenizer returns the padding token string, if configured.
set_pad_token tokenizer token sets which token serves as padding marker. Pass None to unset.
unk_token tokenizer returns the unknown token string, if configured.
set_unk_token tokenizer token sets which token serves as unknown token marker. Pass None to unset.
vocab tokenizer returns the vocabulary as (token, id) pairs.
token_to_id tokenizer token maps a token string to its id.
id_to_token tokenizer id maps an id back to its token string.
val bpe :
?normalizer:Normalizers.t ->
?pre:Pre_tokenizers.t ->
?post:Processors.t ->
?decoder:Decoders.t ->
?specials:special list ->
?bos_token:string ->
?eos_token:string ->
?pad_token:string ->
?unk_token:string ->
?vocab:(string * int) list ->
?merges:(string * string) list ->
?cache_capacity:int ->
?dropout:float ->
?continuing_subword_prefix:string ->
?end_of_word_suffix:string ->
?fuse_unk:bool ->
?byte_fallback:bool ->
?ignore_merges:bool ->
unit ->
tbpe ?normalizer ?pre ?post ?decoder ?specials ?vocab ?merges ?cache_capacity ?dropout ?unk_token ?continuing_subword_prefix ?end_of_word_suffix ?fuse_unk ?byte_fallback ?ignore_merges () creates a BPE (Byte Pair Encoding) tokenizer. Used by GPT-2, GPT-3, RoBERTa.
See Bpe module for algorithm details.
val wordpiece :
?normalizer:Normalizers.t ->
?pre:Pre_tokenizers.t ->
?post:Processors.t ->
?decoder:Decoders.t ->
?specials:special list ->
?bos_token:string ->
?eos_token:string ->
?pad_token:string ->
?unk_token:string ->
?vocab:(string * int) list ->
?continuing_subword_prefix:string ->
?max_input_chars_per_word:int ->
unit ->
twordpiece ?normalizer ?pre ?post ?decoder ?specials ?vocab ?unk_token ?continuing_subword_prefix ?max_input_chars_per_word () creates a WordPiece tokenizer. Used by BERT, DistilBERT, Electra.
WordPiece uses greedy longest-match-first algorithm to split words into subword pieces. Subwords are prefixed to indicate they continue a word (e.g., "##ing" for "running" → "run", "##ning").
See Wordpiece module for algorithm details.
val word_level :
?normalizer:Normalizers.t ->
?pre:Pre_tokenizers.t ->
?post:Processors.t ->
?decoder:Decoders.t ->
?specials:special list ->
?bos_token:string ->
?eos_token:string ->
?pad_token:string ->
?unk_token:string ->
?vocab:(string * int) list ->
unit ->
tword_level ?normalizer ?pre ?post ?decoder ?specials ?vocab ?unk_token () creates a word-level tokenizer.
Maps each word directly to a token ID from vocabulary. No subword splitting. Words not in vocabulary are mapped to unk_token. Simplest tokenization strategy, suitable for smaller vocabularies or domain-specific text.
See Word_level module for algorithm details.
val unigram :
?normalizer:Normalizers.t ->
?pre:Pre_tokenizers.t ->
?post:Processors.t ->
?decoder:Decoders.t ->
?specials:special list ->
?bos_token:string ->
?eos_token:string ->
?pad_token:string ->
?unk_token:string ->
?vocab:(string * float) list ->
?byte_fallback:bool ->
?max_piece_length:int ->
?n_sub_iterations:int ->
?shrinking_factor:float ->
unit ->
tunigram ?normalizer ?pre ?post ?decoder ?specials ?vocab ?unk_token ?byte_fallback ?max_piece_length ?n_sub_iterations ?shrinking_factor () creates a Unigram tokenizer. Used by AlBERT, T5, mBART.
Unigram uses probabilistic segmentation with Viterbi algorithm to find optimal subword splits based on token probabilities. Vocabulary entries have associated scores (negative log probabilities).
See Unigram module for algorithm details.
val chars :
?normalizer:Normalizers.t ->
?pre:Pre_tokenizers.t ->
?post:Processors.t ->
?decoder:Decoders.t ->
?specials:special list ->
?bos_token:string ->
?eos_token:string ->
?pad_token:string ->
?unk_token:string ->
unit ->
tchars ?normalizer ?pre ?post ?decoder ?specials () creates a character-level tokenizer.
Splits text into individual characters. Each character in the input becomes a separate token. Vocabulary is built from unique characters seen. Useful for character-level models or languages with large character sets.
See Chars module for algorithm details.
val regex :
string ->
?normalizer:Normalizers.t ->
?pre:Pre_tokenizers.t ->
?post:Processors.t ->
?decoder:Decoders.t ->
?specials:special list ->
?bos_token:string ->
?eos_token:string ->
?pad_token:string ->
?unk_token:string ->
unit ->
tregex pattern ?normalizer ?pre ?post ?decoder ?specials () creates a regex-based tokenizer.
Splits text using a regular expression pattern. Each match of the pattern becomes a token. Useful for custom tokenization rules or domain-specific formats.
Pattern examples:
"[a-zA-Z]+"matches sequences of letters"[0-9]+"matches sequences of digits"[a-zA-Z]+|[0-9]+|[^a-zA-Z0-9 ]"matches words, numbers, or punctuation
val from_model_file :
vocab:string ->
?merges:string ->
?normalizer:Normalizers.t ->
?pre:Pre_tokenizers.t ->
?post:Processors.t ->
?decoder:Decoders.t ->
?specials:special list ->
?bos_token:string ->
?eos_token:string ->
?pad_token:string ->
?unk_token:string ->
unit ->
tfrom_model_file ~vocab ?merges ?normalizer ?pre ?post ?decoder ?specials () loads tokenizer from HuggingFace format model files.
Loads vocabulary and merge rules from separate files. Model type is inferred from files: if merges file provided, creates BPE tokenizer, otherwise creates WordPiece tokenizer.
let tokenizer =
Tokenizer.from_model_file ~vocab:"vocab.json" ~merges:"merges.txt"
~normalizer:(Normalizers.lowercase ())
~pre:(Pre_tokenizers.byte_level ())
()add_tokens tokenizer tokens adds regular tokens to the underlying vocabulary.
The underlying model is mutated in-place for performance, but the function returns an updated tokenizer value. Not thread-safe: concurrent calls to add_tokens or other mutating operations on the same tokenizer require external synchronization.
val encode :
t ->
?pair:string ->
?add_special_tokens:bool ->
?padding:padding ->
?truncation:truncation ->
string ->
Encoding.tencode tokenizer ?pair ?add_special_tokens ?padding ?truncation text encodes a single sequence.
val encode_batch :
t ->
?pairs:string option list ->
?add_special_tokens:bool ->
?padding:padding ->
?truncation:truncation ->
string list ->
Encoding.t listencode_batch tokenizer ?pairs ?add_special_tokens ?padding ?truncation texts encodes a batch of sequences.
val encode_ids :
t ->
?pair:string ->
?add_special_tokens:bool ->
?padding:padding ->
?truncation:truncation ->
string ->
int arrayencode_ids tokenizer ?pair ?add_special_tokens ?padding ?truncation text is a convenience helper returning just the token ids.
decode tokenizer ?skip_special_tokens ids decodes ids back into text.
decode_batch tokenizer ?skip_special_tokens ids_list decodes a batch of id sequences.
val train_bpe :
?init:t ->
?normalizer:Normalizers.t ->
?pre:Pre_tokenizers.t ->
?post:Processors.t ->
?decoder:Decoders.t ->
?specials:special list ->
?bos_token:string ->
?eos_token:string ->
?pad_token:string ->
?unk_token:string ->
?vocab_size:int ->
?min_frequency:int ->
?limit_alphabet:int ->
?initial_alphabet:string list ->
?continuing_subword_prefix:string ->
?end_of_word_suffix:string ->
?show_progress:bool ->
?max_token_length:int ->
data ->
ttrain_bpe ?init ?normalizer ?pre ?post ?decoder ?specials ?vocab_size ?min_frequency ?limit_alphabet ?initial_alphabet ?continuing_subword_prefix ?end_of_word_suffix ?show_progress ?max_token_length data trains a BPE tokenizer from training data.
Learns merge rules by iteratively merging the most frequent adjacent character or subword pairs until reaching target vocabulary size.
let texts = ["Hello world"; "How are you?"; "Hello again"] in
let tokenizer = Tokenizer.train_bpe (`Seq (List.to_seq texts))
~vocab_size:1000
~min_frequency:2
~show_progress:falseval train_wordpiece :
?init:t ->
?normalizer:Normalizers.t ->
?pre:Pre_tokenizers.t ->
?post:Processors.t ->
?decoder:Decoders.t ->
?specials:special list ->
?bos_token:string ->
?eos_token:string ->
?pad_token:string ->
?unk_token:string ->
?vocab_size:int ->
?min_frequency:int ->
?limit_alphabet:int ->
?initial_alphabet:string list ->
?continuing_subword_prefix:string ->
?end_of_word_suffix:string ->
?show_progress:bool ->
data ->
ttrain_wordpiece ?init ?normalizer ?pre ?post ?decoder ?specials ?vocab_size ?min_frequency ?limit_alphabet ?initial_alphabet ?continuing_subword_prefix ?end_of_word_suffix ?unk_token ?show_progress data trains a WordPiece tokenizer from training data.
Learns subword vocabulary by maximizing language model likelihood, selecting subwords that maximize corpus representation efficiency.
val train_wordlevel :
?init:t ->
?normalizer:Normalizers.t ->
?pre:Pre_tokenizers.t ->
?post:Processors.t ->
?decoder:Decoders.t ->
?specials:special list ->
?bos_token:string ->
?eos_token:string ->
?pad_token:string ->
?unk_token:string ->
?vocab_size:int ->
?min_frequency:int ->
?show_progress:bool ->
data ->
ttrain_wordlevel ?init ?normalizer ?pre ?post ?decoder ?specials ?vocab_size ?min_frequency ?show_progress data trains a word-level tokenizer from training data.
Builds vocabulary by collecting unique words from training data, optionally filtering by frequency. No subword splitting.
val train_unigram :
?init:t ->
?normalizer:Normalizers.t ->
?pre:Pre_tokenizers.t ->
?post:Processors.t ->
?decoder:Decoders.t ->
?specials:special list ->
?bos_token:string ->
?eos_token:string ->
?pad_token:string ->
?unk_token:string ->
?vocab_size:int ->
?show_progress:bool ->
?shrinking_factor:float ->
?max_piece_length:int ->
?n_sub_iterations:int ->
data ->
ttrain_unigram ?init ?normalizer ?pre ?post ?decoder ?specials ?vocab_size ?show_progress ?shrinking_factor ?unk_token ?max_piece_length ?n_sub_iterations data trains a Unigram tokenizer from training data.
Learns probabilistic subword vocabulary using EM algorithm. Starts with large candidate vocabulary and iteratively prunes low-likelihood pieces until reaching target size.
export_tiktoken tokenizer ~merges_path ~vocab_path exports the BPE merges and vocabulary in a tiktoken-compatible format. Currently only supported for BPE models.
save_model_files tokenizer ~folder ?prefix () saves the underlying model files (e.g. vocab and merges).
Hugging Face Compatibility
from_file path loads a tokenizer from HuggingFace JSON format.
from_json json deserializes a tokenizer from HuggingFace JSON format.
to_json tokenizer serializes tokenizer to HuggingFace JSON format.