Tokenizers library - text tokenization for ML. This module provides fast and flexible tokenization for machine learning applications, supporting multiple algorithms from simple word splitting to advanced subword tokenization like BPE, Unigram, WordLevel, and WordPiece. The API is designed to match Hugging Face Tokenizers v0.21 as closely as possible, adapted to idiomatic OCaml with functional style, records for configurations, polymorphic variants for enums, default values for optionals, and result types for fallible operations. The central type is Tokenizer.t, which represents a configurable tokenization pipeline.
Quick Start
open Saga.Tokenizers
(* Create a character-level tokenizer *)
let tokenizer = Tokenizer.create ~model:(Model.chars ()) in
(* Add special tokens *)
Tokenizer.add_special_tokens tokenizer [Added_token.create ~content:"." ~special:true ()];
(* Train on data *)
Tokenizer.train_from_iterator tokenizer (Seq.of_list names) ~trainer:(Trainer.chars ()) ();
(* Encode with options *)
let encoding = Tokenizer.encode tokenizer ~sequence:"hello world" ~add_special_tokens:true ();
(* Get ids and decode *)
let ids = Encoding.ids encoding in
let text = Tokenizer.decode tokenizer ids ~skip_special_tokens:true;
Key Concepts
Tokenizer.t: The main tokenizer instance, configurable with model, normalizer, etc.
Encoding.t: Result of encoding, with ids, tokens, offsets, masks, etc., exposed as a record.
Special tokens: Handled via add_special_tokens and encoding options.
All functions handle Unicode correctly via the Unicode module. This API aligns with Hugging Face Tokenizers v0.21 (as of 2025), including support for fast Rust-backed operations where applicable.