package saga

  1. Overview
  2. Docs
Text processing and NLP extensions for Nx

Install

dune-project
 Dependency

Authors

Maintainers

Sources

raven-1.0.0.alpha2.tbz
sha256=93abc49d075a1754442ccf495645bc4fdc83e4c66391ec8aca8fa15d2b4f44d2
sha512=5eb958c51f30ae46abded4c96f48d1825f79c7ce03f975f9a6237cdfed0d62c0b4a0774296694def391573d849d1f869919c49008acffca95946b818ad325f6f

doc/saga.tokenizers/Saga_tokenizers/Tokenizer/index.html

Module Saga_tokenizers.TokenizerSource

Sourcetype t
Sourceval normalizer : t -> Normalizers.t option

normalizer tokenizer retrieves the configured normalizer.

Returns None if no normalizer is set. The normalizer is applied before all other processing stages to clean and normalize text.

Sourceval with_normalizer : t -> Normalizers.t option -> t

with_normalizer tokenizer norm replaces the tokenizer's normalizer.

Pass None to remove the normalization step entirely. Pass Some norm to install a new normalizer. Returns updated tokenizer.

  let tokenizer = Tokenizer.bpe () in
  let tokenizer = Tokenizer.with_normalizer tokenizer
    (Some (Normalizers.sequence [
      Normalizers.nfd ();
      Normalizers.lowercase ();
      Normalizers.strip_accents ();
    ]))
Sourceval pre_tokenizer : t -> Pre_tokenizers.t option

pre_tokenizer tokenizer retrieves the configured pre-tokenizer.

Returns None if no pre-tokenizer is set. The pre-tokenizer splits text into pieces before vocabulary-based encoding.

Sourceval with_pre_tokenizer : t -> Pre_tokenizers.t option -> t

with_pre_tokenizer tokenizer pre replaces the tokenizer's pre-tokenizer.

Pass None to remove pre-tokenization (text processed as-is). Pass Some pre to install a new pre-tokenizer. Returns updated tokenizer.

  let tokenizer = Tokenizer.bpe () in
  let tokenizer = Tokenizer.with_pre_tokenizer tokenizer
    (Some (Pre_tokenizers.byte_level ~add_prefix_space:true ()))
Sourceval post_processor : t -> Processors.t option

post_processor tokenizer retrieves the configured post-processor.

Returns None if no post-processor is set. The post-processor adds special tokens and sets type IDs after encoding.

Sourceval with_post_processor : t -> Processors.t option -> t

with_post_processor tokenizer post replaces the tokenizer's post-processor.

Pass None to remove post-processing. Pass Some post to install a new post-processor. Returns updated tokenizer.

  let tokenizer = Tokenizer.bpe () in
  let tokenizer = Tokenizer.with_post_processor tokenizer
    (Some (Processors.bert_processing
      ~sep:("[SEP]", 102) ~cls:("[CLS]", 101) ()))
Sourceval decoder : t -> Decoders.t option

decoder tokenizer retrieves the configured decoder.

Returns None if no decoder is set. The decoder converts token IDs back to text.

Sourceval with_decoder : t -> Decoders.t option -> t

with_decoder tokenizer dec replaces the tokenizer's decoder.

Pass None to use default decoding (concatenate tokens). Pass Some dec to install a new decoder. Returns updated tokenizer.

  let tokenizer = Tokenizer.bpe () in
  let tokenizer = Tokenizer.with_decoder tokenizer
    (Some (Decoders.byte_level ()))
Sourceval specials : t -> special list

specials tokenizer retrieves the configured special tokens.

Sourceval with_specials : t -> special list -> t

with_specials tokenizer specials replaces the special tokens with the provided list.

Sourceval add_specials : t -> special list -> t

add_specials tokenizer specials extends the set of special tokens.

Special Token Roles

These functions configure which token strings serve specific roles in the tokenizer (BOS, EOS, PAD, UNK). This follows HuggingFace's design where roles are separate from token properties.

Sourceval bos_token : t -> string option

bos_token tokenizer returns the beginning-of-sequence token string, if configured.

Sourceval set_bos_token : t -> string option -> t

set_bos_token tokenizer token sets which token serves as beginning-of-sequence marker. Pass None to unset. The token should already be in the vocabulary.

Sourceval eos_token : t -> string option

eos_token tokenizer returns the end-of-sequence token string, if configured.

Sourceval set_eos_token : t -> string option -> t

set_eos_token tokenizer token sets which token serves as end-of-sequence marker. Pass None to unset.

Sourceval pad_token : t -> string option

pad_token tokenizer returns the padding token string, if configured.

Sourceval set_pad_token : t -> string option -> t

set_pad_token tokenizer token sets which token serves as padding marker. Pass None to unset.

Sourceval unk_token : t -> string option

unk_token tokenizer returns the unknown token string, if configured.

Sourceval set_unk_token : t -> string option -> t

set_unk_token tokenizer token sets which token serves as unknown token marker. Pass None to unset.

Sourceval vocab : t -> (string * int) list

vocab tokenizer returns the vocabulary as (token, id) pairs.

Sourceval vocab_size : t -> int

vocab_size tokenizer returns the size of the vocabulary.

Sourceval token_to_id : t -> string -> int option

token_to_id tokenizer token maps a token string to its id.

Sourceval id_to_token : t -> int -> string option

id_to_token tokenizer id maps an id back to its token string.

Sourceval bpe : ?normalizer:Normalizers.t -> ?pre:Pre_tokenizers.t -> ?post:Processors.t -> ?decoder:Decoders.t -> ?specials:special list -> ?bos_token:string -> ?eos_token:string -> ?pad_token:string -> ?unk_token:string -> ?vocab:(string * int) list -> ?merges:(string * string) list -> ?cache_capacity:int -> ?dropout:float -> ?continuing_subword_prefix:string -> ?end_of_word_suffix:string -> ?fuse_unk:bool -> ?byte_fallback:bool -> ?ignore_merges:bool -> unit -> t

bpe ?normalizer ?pre ?post ?decoder ?specials ?vocab ?merges ?cache_capacity ?dropout ?unk_token ?continuing_subword_prefix ?end_of_word_suffix ?fuse_unk ?byte_fallback ?ignore_merges () creates a BPE (Byte Pair Encoding) tokenizer. Used by GPT-2, GPT-3, RoBERTa.

  • parameter normalizer

    Text normalization (e.g., lowercase, strip accents)

  • parameter pre

    Pre-tokenization strategy (e.g., whitespace splitting)

  • parameter post

    Post-processor for special tokens (CLS, SEP)

  • parameter decoder

    Decoding strategy to reverse tokenization

  • parameter specials

    Special tokens to add to vocabulary

  • parameter bos_token

    Token to use as beginning-of-sequence marker. Configures both the role and adds to vocabulary if not present.

  • parameter eos_token

    Token to use as end-of-sequence marker. Configures both the role and adds to vocabulary if not present.

  • parameter pad_token

    Token to use as padding marker. Configures both the role and adds to vocabulary if not present.

  • parameter unk_token

    Token for unknown characters. Configures both the role and the BPE model's unknown handling. Default: None (no unknown handling).

  • parameter vocab

    Initial vocabulary mapping tokens to IDs

  • parameter merges

    Merge rules as (token1, token2) pairs learned during training

  • parameter cache_capacity

    LRU cache size for tokenization results. Default: 10000. Higher = faster for repeated inputs but more memory.

  • parameter dropout

    Probability of skipping merges during tokenization (0.0-1.0). Default: None (no dropout). Used for data augmentation. At 1.0, no merges applied (character-level).

  • parameter continuing_subword_prefix

    Prefix for non-initial subwords (e.g., "##" for BERT). Default: None.

  • parameter end_of_word_suffix

    Suffix marking word boundaries (e.g., "</w>"). Default: None.

  • parameter fuse_unk

    Whether to merge consecutive unknown tokens. Default: false.

  • parameter byte_fallback

    Use byte-level fallback for unknown chars (e.g., "<0x00>") instead of UNK. Default: false.

  • parameter ignore_merges

    Skip merge application (character-level output). Default: false.

See Bpe module for algorithm details.

Sourceval wordpiece : ?normalizer:Normalizers.t -> ?pre:Pre_tokenizers.t -> ?post:Processors.t -> ?decoder:Decoders.t -> ?specials:special list -> ?bos_token:string -> ?eos_token:string -> ?pad_token:string -> ?unk_token:string -> ?vocab:(string * int) list -> ?continuing_subword_prefix:string -> ?max_input_chars_per_word:int -> unit -> t

wordpiece ?normalizer ?pre ?post ?decoder ?specials ?vocab ?unk_token ?continuing_subword_prefix ?max_input_chars_per_word () creates a WordPiece tokenizer. Used by BERT, DistilBERT, Electra.

WordPiece uses greedy longest-match-first algorithm to split words into subword pieces. Subwords are prefixed to indicate they continue a word (e.g., "##ing" for "running" → "run", "##ning").

  • parameter normalizer

    Text normalization (e.g., lowercase, strip accents)

  • parameter pre

    Pre-tokenization strategy (e.g., whitespace splitting)

  • parameter post

    Post-processor for special tokens (CLS, SEP)

  • parameter decoder

    Decoding strategy to reverse tokenization

  • parameter specials

    Special tokens to add to vocabulary

  • parameter vocab

    Initial vocabulary mapping tokens to IDs

  • parameter unk_token

    Token for out-of-vocabulary words. Default: "UNK".

  • parameter continuing_subword_prefix

    Prefix for non-initial subwords (e.g., "##"). Default: "##".

  • parameter max_input_chars_per_word

    Maximum characters per word. Words longer than this are replaced with unk_token. Default: 100.

See Wordpiece module for algorithm details.

Sourceval word_level : ?normalizer:Normalizers.t -> ?pre:Pre_tokenizers.t -> ?post:Processors.t -> ?decoder:Decoders.t -> ?specials:special list -> ?bos_token:string -> ?eos_token:string -> ?pad_token:string -> ?unk_token:string -> ?vocab:(string * int) list -> unit -> t

word_level ?normalizer ?pre ?post ?decoder ?specials ?vocab ?unk_token () creates a word-level tokenizer.

Maps each word directly to a token ID from vocabulary. No subword splitting. Words not in vocabulary are mapped to unk_token. Simplest tokenization strategy, suitable for smaller vocabularies or domain-specific text.

  • parameter normalizer

    Text normalization (e.g., lowercase, strip accents)

  • parameter pre

    Pre-tokenization strategy (e.g., whitespace splitting)

  • parameter post

    Post-processor for special tokens

  • parameter decoder

    Decoding strategy to reverse tokenization

  • parameter specials

    Special tokens to add to vocabulary

  • parameter vocab

    Initial vocabulary mapping words to IDs

  • parameter unk_token

    Token for out-of-vocabulary words. Default: "UNK".

See Word_level module for algorithm details.

Sourceval unigram : ?normalizer:Normalizers.t -> ?pre:Pre_tokenizers.t -> ?post:Processors.t -> ?decoder:Decoders.t -> ?specials:special list -> ?bos_token:string -> ?eos_token:string -> ?pad_token:string -> ?unk_token:string -> ?vocab:(string * float) list -> ?byte_fallback:bool -> ?max_piece_length:int -> ?n_sub_iterations:int -> ?shrinking_factor:float -> unit -> t

unigram ?normalizer ?pre ?post ?decoder ?specials ?vocab ?unk_token ?byte_fallback ?max_piece_length ?n_sub_iterations ?shrinking_factor () creates a Unigram tokenizer. Used by AlBERT, T5, mBART.

Unigram uses probabilistic segmentation with Viterbi algorithm to find optimal subword splits based on token probabilities. Vocabulary entries have associated scores (negative log probabilities).

  • parameter normalizer

    Text normalization (e.g., lowercase, strip accents)

  • parameter pre

    Pre-tokenization strategy (e.g., whitespace splitting)

  • parameter post

    Post-processor for special tokens

  • parameter decoder

    Decoding strategy to reverse tokenization

  • parameter specials

    Special tokens to add to vocabulary

  • parameter vocab

    Initial vocabulary mapping tokens to scores (negative log probabilities). Higher scores = less likely.

  • parameter unk_token

    Token for unknown characters. Default: None.

  • parameter byte_fallback

    Use byte-level fallback for unknown chars instead of UNK. Default: false.

  • parameter max_piece_length

    Maximum characters per piece during training. Default: 16.

  • parameter n_sub_iterations

    Number of EM sub-iterations during training. Default: 2.

  • parameter shrinking_factor

    Fraction of vocabulary to keep in each pruning step during training (0.0-1.0). Default: 0.75.

See Unigram module for algorithm details.

Sourceval chars : ?normalizer:Normalizers.t -> ?pre:Pre_tokenizers.t -> ?post:Processors.t -> ?decoder:Decoders.t -> ?specials:special list -> ?bos_token:string -> ?eos_token:string -> ?pad_token:string -> ?unk_token:string -> unit -> t

chars ?normalizer ?pre ?post ?decoder ?specials () creates a character-level tokenizer.

Splits text into individual characters. Each character in the input becomes a separate token. Vocabulary is built from unique characters seen. Useful for character-level models or languages with large character sets.

  • parameter normalizer

    Text normalization (e.g., lowercase)

  • parameter pre

    Pre-tokenization strategy (usually None for char-level)

  • parameter post

    Post-processor for special tokens

  • parameter decoder

    Decoding strategy to reverse tokenization

  • parameter specials

    Special tokens to add to vocabulary

See Chars module for algorithm details.

Sourceval regex : string -> ?normalizer:Normalizers.t -> ?pre:Pre_tokenizers.t -> ?post:Processors.t -> ?decoder:Decoders.t -> ?specials:special list -> ?bos_token:string -> ?eos_token:string -> ?pad_token:string -> ?unk_token:string -> unit -> t

regex pattern ?normalizer ?pre ?post ?decoder ?specials () creates a regex-based tokenizer.

Splits text using a regular expression pattern. Each match of the pattern becomes a token. Useful for custom tokenization rules or domain-specific formats.

  • parameter pattern

    Regular expression pattern (Str module syntax) used to match tokens. Each match becomes a separate token.

Pattern examples:

  • "[a-zA-Z]+" matches sequences of letters
  • "[0-9]+" matches sequences of digits
  • "[a-zA-Z]+|[0-9]+|[^a-zA-Z0-9 ]" matches words, numbers, or punctuation
  • parameter normalizer

    Text normalization (e.g., lowercase, strip accents)

  • parameter pre

    Pre-tokenization strategy (applied before pattern matching)

  • parameter post

    Post-processor for special tokens

  • parameter decoder

    Decoding strategy to reverse tokenization

  • parameter specials

    Special tokens to add to vocabulary

Sourceval from_model_file : vocab:string -> ?merges:string -> ?normalizer:Normalizers.t -> ?pre:Pre_tokenizers.t -> ?post:Processors.t -> ?decoder:Decoders.t -> ?specials:special list -> ?bos_token:string -> ?eos_token:string -> ?pad_token:string -> ?unk_token:string -> unit -> t

from_model_file ~vocab ?merges ?normalizer ?pre ?post ?decoder ?specials () loads tokenizer from HuggingFace format model files.

Loads vocabulary and merge rules from separate files. Model type is inferred from files: if merges file provided, creates BPE tokenizer, otherwise creates WordPiece tokenizer.

  • parameter vocab

    Path to vocabulary file (vocab.json). Expected format: JSON object mapping tokens to IDs: {"hello": 0, "world": 1, "[PAD]": 2}.

  • parameter merges

    Path to merges file (merges.txt). Expected format: one merge per line as space-separated token pairs: "he llo", "wor ld". First line may be header (ignored if starts with "#version"). Optional for WordPiece, required for BPE.

  • parameter normalizer

    Text normalization to apply

  • parameter pre

    Pre-tokenization strategy

  • parameter post

    Post-processor for special tokens

  • parameter decoder

    Decoding strategy

  • parameter specials

    Special tokens to add to vocabulary

  let tokenizer =
    Tokenizer.from_model_file ~vocab:"vocab.json" ~merges:"merges.txt"
      ~normalizer:(Normalizers.lowercase ())
      ~pre:(Pre_tokenizers.byte_level ())
      ()
Sourceval add_tokens : t -> string list -> t

add_tokens tokenizer tokens adds regular tokens to the underlying vocabulary.

The underlying model is mutated in-place for performance, but the function returns an updated tokenizer value. Not thread-safe: concurrent calls to add_tokens or other mutating operations on the same tokenizer require external synchronization.

Sourceval encode : t -> ?pair:string -> ?add_special_tokens:bool -> ?padding:padding -> ?truncation:truncation -> string -> Encoding.t

encode tokenizer ?pair ?add_special_tokens ?padding ?truncation text encodes a single sequence.

Sourceval encode_batch : t -> ?pairs:string option list -> ?add_special_tokens:bool -> ?padding:padding -> ?truncation:truncation -> string list -> Encoding.t list

encode_batch tokenizer ?pairs ?add_special_tokens ?padding ?truncation texts encodes a batch of sequences.

Sourceval encode_ids : t -> ?pair:string -> ?add_special_tokens:bool -> ?padding:padding -> ?truncation:truncation -> string -> int array

encode_ids tokenizer ?pair ?add_special_tokens ?padding ?truncation text is a convenience helper returning just the token ids.

Sourceval decode : t -> ?skip_special_tokens:bool -> int array -> string

decode tokenizer ?skip_special_tokens ids decodes ids back into text.

Sourceval decode_batch : t -> ?skip_special_tokens:bool -> int array list -> string list

decode_batch tokenizer ?skip_special_tokens ids_list decodes a batch of id sequences.

Sourceval train_bpe : ?init:t -> ?normalizer:Normalizers.t -> ?pre:Pre_tokenizers.t -> ?post:Processors.t -> ?decoder:Decoders.t -> ?specials:special list -> ?bos_token:string -> ?eos_token:string -> ?pad_token:string -> ?unk_token:string -> ?vocab_size:int -> ?min_frequency:int -> ?limit_alphabet:int -> ?initial_alphabet:string list -> ?continuing_subword_prefix:string -> ?end_of_word_suffix:string -> ?show_progress:bool -> ?max_token_length:int -> data -> t

train_bpe ?init ?normalizer ?pre ?post ?decoder ?specials ?vocab_size ?min_frequency ?limit_alphabet ?initial_alphabet ?continuing_subword_prefix ?end_of_word_suffix ?show_progress ?max_token_length data trains a BPE tokenizer from training data.

Learns merge rules by iteratively merging the most frequent adjacent character or subword pairs until reaching target vocabulary size.

  • parameter init

    Existing tokenizer to extend. If provided, training adds to existing vocabulary. Default: create new tokenizer.

  • parameter normalizer

    Text normalization applied before training

  • parameter pre

    Pre-tokenization strategy applied before training

  • parameter post

    Post-processor for special tokens

  • parameter decoder

    Decoding strategy

  • parameter specials

    Special tokens to add to vocabulary

  • parameter vocab_size

    Target vocabulary size including special tokens. Training continues until this size is reached. Default: 30000.

  • parameter min_frequency

    Minimum occurrences for a token pair to be merged. Higher values create smaller vocabularies with more common subwords. Typical: 2-10. Default: 2.

  • parameter limit_alphabet

    Maximum initial characters in alphabet. Limits character set to most frequent characters. Default: None (unlimited).

  • parameter initial_alphabet

    Explicit initial character set. If provided, overrides automatic alphabet discovery. Default: None.

  • parameter continuing_subword_prefix

    Prefix for non-initial subwords (e.g., "##"). Default: None.

  • parameter end_of_word_suffix

    Suffix marking word boundaries (e.g., "</w>"). Default: None.

  • parameter show_progress

    Display progress bar during training. Requires training data to support progress tracking (e.g., `Files or `Seq). Default: true.

  • parameter max_token_length

    Maximum characters per token during merge operations. Pairs creating tokens longer than this are skipped. Default: None (unlimited).

  • parameter data

    Training data source

  let texts = ["Hello world"; "How are you?"; "Hello again"] in
  let tokenizer = Tokenizer.train_bpe (`Seq (List.to_seq texts))
    ~vocab_size:1000
    ~min_frequency:2
    ~show_progress:false
Sourceval train_wordpiece : ?init:t -> ?normalizer:Normalizers.t -> ?pre:Pre_tokenizers.t -> ?post:Processors.t -> ?decoder:Decoders.t -> ?specials:special list -> ?bos_token:string -> ?eos_token:string -> ?pad_token:string -> ?unk_token:string -> ?vocab_size:int -> ?min_frequency:int -> ?limit_alphabet:int -> ?initial_alphabet:string list -> ?continuing_subword_prefix:string -> ?end_of_word_suffix:string -> ?show_progress:bool -> data -> t

train_wordpiece ?init ?normalizer ?pre ?post ?decoder ?specials ?vocab_size ?min_frequency ?limit_alphabet ?initial_alphabet ?continuing_subword_prefix ?end_of_word_suffix ?unk_token ?show_progress data trains a WordPiece tokenizer from training data.

Learns subword vocabulary by maximizing language model likelihood, selecting subwords that maximize corpus representation efficiency.

  • parameter init

    Existing tokenizer to extend. Default: create new tokenizer.

  • parameter normalizer

    Text normalization applied before training

  • parameter pre

    Pre-tokenization strategy applied before training

  • parameter post

    Post-processor for special tokens

  • parameter decoder

    Decoding strategy

  • parameter specials

    Special tokens to add to vocabulary

  • parameter vocab_size

    Target vocabulary size including special tokens. Default: 30000.

  • parameter min_frequency

    Minimum occurrences for a subword to be included. Higher values create smaller vocabularies. Typical: 2-10. Default: 2.

  • parameter limit_alphabet

    Maximum initial characters in alphabet. Default: None (unlimited).

  • parameter initial_alphabet

    Explicit initial character set. Default: None.

  • parameter continuing_subword_prefix

    Prefix for non-initial subwords (e.g., "##"). Default: "##".

  • parameter end_of_word_suffix

    Suffix marking word boundaries. Default: None.

  • parameter unk_token

    Token for out-of-vocabulary words. Default: "UNK".

  • parameter show_progress

    Display progress bar during training. Default: true.

  • parameter data

    Training data source

Sourceval train_wordlevel : ?init:t -> ?normalizer:Normalizers.t -> ?pre:Pre_tokenizers.t -> ?post:Processors.t -> ?decoder:Decoders.t -> ?specials:special list -> ?bos_token:string -> ?eos_token:string -> ?pad_token:string -> ?unk_token:string -> ?vocab_size:int -> ?min_frequency:int -> ?show_progress:bool -> data -> t

train_wordlevel ?init ?normalizer ?pre ?post ?decoder ?specials ?vocab_size ?min_frequency ?show_progress data trains a word-level tokenizer from training data.

Builds vocabulary by collecting unique words from training data, optionally filtering by frequency. No subword splitting.

  • parameter init

    Existing tokenizer to extend. Default: create new tokenizer.

  • parameter normalizer

    Text normalization applied before training

  • parameter pre

    Pre-tokenization strategy applied before training

  • parameter post

    Post-processor for special tokens

  • parameter decoder

    Decoding strategy

  • parameter specials

    Special tokens to add to vocabulary

  • parameter vocab_size

    Target vocabulary size including special tokens. Training includes most frequent words up to this limit. Default: 30000.

  • parameter min_frequency

    Minimum occurrences for a word to be included. Higher values create smaller vocabularies of common words. Typical: 2-10. Default: 0 (include all words).

  • parameter show_progress

    Display progress bar during training. Default: true.

  • parameter data

    Training data source

Sourceval train_unigram : ?init:t -> ?normalizer:Normalizers.t -> ?pre:Pre_tokenizers.t -> ?post:Processors.t -> ?decoder:Decoders.t -> ?specials:special list -> ?bos_token:string -> ?eos_token:string -> ?pad_token:string -> ?unk_token:string -> ?vocab_size:int -> ?show_progress:bool -> ?shrinking_factor:float -> ?max_piece_length:int -> ?n_sub_iterations:int -> data -> t

train_unigram ?init ?normalizer ?pre ?post ?decoder ?specials ?vocab_size ?show_progress ?shrinking_factor ?unk_token ?max_piece_length ?n_sub_iterations data trains a Unigram tokenizer from training data.

Learns probabilistic subword vocabulary using EM algorithm. Starts with large candidate vocabulary and iteratively prunes low-likelihood pieces until reaching target size.

  • parameter init

    Existing tokenizer to extend. Default: create new tokenizer.

  • parameter normalizer

    Text normalization applied before training

  • parameter pre

    Pre-tokenization strategy applied before training

  • parameter post

    Post-processor for special tokens

  • parameter decoder

    Decoding strategy

  • parameter specials

    Special tokens to add to vocabulary

  • parameter vocab_size

    Target vocabulary size including special tokens. Default: 8000.

  • parameter show_progress

    Display progress bar during training. Default: true.

  • parameter shrinking_factor

    Fraction of vocabulary to retain in each pruning iteration (0.0-1.0). Larger values slow convergence but may improve quality. Typical: 0.75-0.95. Default: 0.75.

  • parameter unk_token

    Token for unknown characters. Default: None.

  • parameter max_piece_length

    Maximum characters per subword piece. Longer pieces are not considered. Typical: 8-32. Default: 16.

  • parameter n_sub_iterations

    Number of EM sub-iterations per pruning step. Higher values improve accuracy but slow training. Typical: 2-5. Default: 2.

  • parameter data

    Training data source

Sourceval export_tiktoken : t -> merges_path:string -> vocab_path:string -> unit

export_tiktoken tokenizer ~merges_path ~vocab_path exports the BPE merges and vocabulary in a tiktoken-compatible format. Currently only supported for BPE models.

Sourceval save_model_files : t -> folder:string -> ?prefix:string -> unit -> string list

save_model_files tokenizer ~folder ?prefix () saves the underlying model files (e.g. vocab and merges).

Hugging Face Compatibility

Sourceval from_file : string -> (t, exn) result

from_file path loads a tokenizer from HuggingFace JSON format.

Sourceval from_json : Yojson.Basic.t -> (t, exn) result

from_json json deserializes a tokenizer from HuggingFace JSON format.

Sourceval to_json : t -> Yojson.Basic.t

to_json tokenizer serializes tokenizer to HuggingFace JSON format.

Sourceval save_pretrained : t -> path:string -> unit

save_pretrained tokenizer ~path saves tokenizer to directory in HuggingFace format.