package saga
Install
dune-project
Dependency
Authors
Maintainers
Sources
sha256=8e277ed56615d388bc69c4333e43d1acd112b5f2d5d352e2453aef223ff59867
sha512=369eda6df6b84b08f92c8957954d107058fb8d3d8374082e074b56f3a139351b3ae6e3a99f2d4a4a2930dd950fd609593467e502368a13ad6217b571382da28c
doc/saga.tokenizers/Saga_tokenizers/index.html
Module Saga_tokenizers
Source
Tokenizers library - text tokenization for ML. This module provides fast and flexible tokenization for machine learning applications, supporting multiple algorithms from simple word splitting to advanced subword tokenization like BPE, Unigram, WordLevel, and WordPiece. The API is designed to match Hugging Face Tokenizers v0.21 as closely as possible, adapted to idiomatic OCaml with functional style, records for configurations, polymorphic variants for enums, default values for optionals, and result types for fallible operations. The central type is Tokenizer.t
, which represents a configurable tokenization pipeline.
Quick Start
open Saga.Tokenizers
(* Create a character-level tokenizer *)
let tokenizer = Tokenizer.create ~model:(Model.chars ()) in
(* Add special tokens *)
Tokenizer.add_special_tokens tokenizer [Added_token.create ~content:"." ~special:true ()];
(* Train on data *)
Tokenizer.train_from_iterator tokenizer (Seq.of_list names) ~trainer:(Trainer.chars ()) ();
(* Encode with options *)
let encoding = Tokenizer.encode tokenizer ~sequence:"hello world" ~add_special_tokens:true ();
(* Get ids and decode *)
let ids = Encoding.ids encoding in
let text = Tokenizer.decode tokenizer ids ~skip_special_tokens:true;
Key Concepts
Tokenizer.t
: The main tokenizer instance, configurable with model, normalizer, etc.Models.t
: Core tokenization algorithm (e.g., Chars, BPE).Encoding.t
: Result of encoding, with ids, tokens, offsets, masks, etc., exposed as a record.- Special tokens: Handled via
add_special_tokens
and encoding options. - All functions handle Unicode correctly via the
Unicode
module. This API aligns with Hugging Face Tokenizers v0.21 (as of 2025), including support for fast Rust-backed operations where applicable.
Text normalization module matching HuggingFace tokenizers.
Pre-tokenization for text processing pipelines.
Post-processing module for tokenization output.
Enums as Polymorphic Variants
Padding or truncation direction.
Behavior for splitting delimiters.
Truncation strategy.
Prepend scheme for metaspace.