This module provides the main tokenization API matching HuggingFace Tokenizers design. It supports multiple tokenization algorithms (BPE, WordPiece, Unigram, Word-level, Character-level), text normalization, pre-tokenization, post-processing, and decoding.
Quick Start
Load a pretrained tokenizer:
let tokenizer = Tokenizer.from_file "tokenizer.json" |> Result.get_ok in
let encoding = Tokenizer.encode tokenizer "Hello world!" in
let ids = Encoding.get_ids encoding
Create a BPE tokenizer from scratch:
let tokenizer =
Tokenizer.bpe
~vocab:[("hello", 0); ("world", 1); ("[PAD]", 2)]
~merges:[]
()
in
let encoding = Tokenizer.encode tokenizer "hello world" in
let text = Tokenizer.decode tokenizer [0; 1]
Train a new tokenizer:
let texts = [ "Hello world"; "How are you?"; "Hello again" ] in
let tokenizer =
Tokenizer.train_bpe (`Seq (List.to_seq texts)) ~vocab_size:1000 ()
in
Tokenizer.save_pretrained tokenizer ~path:"./my_tokenizer"
Architecture
Tokenization proceeds through stages:
Normalization: Clean and normalize text (lowercase, accent removal, etc.)
Pre-tokenization: Split text into words or subwords
Whether this token must match whole words only. Default: false.
*)
lstrip : bool;
(*
Whether to strip whitespace on the left. Default: false.
*)
rstrip : bool;
(*
Whether to strip whitespace on the right. Default: false.
*)
normalized : bool;
(*
Whether to apply normalization to this token. Default: true for regular tokens, false for special tokens.
*)
}
Special token configuration.
Special tokens are not split during tokenization and can be skipped during decoding. Token IDs are assigned automatically when added to the vocabulary.
All special token types are uniform - the semantic meaning (pad, unk, bos, etc.) is contextual, not encoded in the type.
When optional fields are None, falls back to tokenizer's configured padding token. If the tokenizer has no padding token configured and these fields are None, padding operations will raise Invalid_argument.