package saga

  1. Overview
  2. Docs

Module Saga_tokenizersSource

Tokenizers library - text tokenization for ML. This module provides fast and flexible tokenization for machine learning applications, supporting multiple algorithms from simple word splitting to advanced subword tokenization like BPE, Unigram, WordLevel, and WordPiece. The API is designed to match Hugging Face Tokenizers v0.21 as closely as possible, adapted to idiomatic OCaml with functional style, records for configurations, polymorphic variants for enums, default values for optionals, and result types for fallible operations. The central type is Tokenizer.t, which represents a configurable tokenization pipeline.

Quick Start

  open Saga.Tokenizers
  (* Create a character-level tokenizer *)
  let tokenizer = Tokenizer.create ~model:(Model.chars ()) in
  (* Add special tokens *)
  Tokenizer.add_special_tokens tokenizer [Added_token.create ~content:"." ~special:true ()];
  (* Train on data *)
  Tokenizer.train_from_iterator tokenizer (Seq.of_list names) ~trainer:(Trainer.chars ()) ();
  (* Encode with options *)
  let encoding = Tokenizer.encode tokenizer ~sequence:"hello world" ~add_special_tokens:true ();
  (* Get ids and decode *)
  let ids = Encoding.ids encoding in
  let text = Tokenizer.decode tokenizer ids ~skip_special_tokens:true;

Key Concepts

  • Tokenizer.t: The main tokenizer instance, configurable with model, normalizer, etc.
  • Models.t: Core tokenization algorithm (e.g., Chars, BPE).
  • Encoding.t: Result of encoding, with ids, tokens, offsets, masks, etc., exposed as a record.
  • Special tokens: Handled via add_special_tokens and encoding options.
  • All functions handle Unicode correctly via the Unicode module. This API aligns with Hugging Face Tokenizers v0.21 (as of 2025), including support for fast Rust-backed operations where applicable.
Sourcemodule Either : sig ... end

Either type for API compatibility.

Sourcemodule Unicode : sig ... end

Unicode utilities.

Sourcemodule Models : sig ... end

Tokenization models module.

Sourcemodule Normalizers : sig ... end

Text normalization module matching HuggingFace tokenizers.

Sourcemodule Pre_tokenizers : sig ... end

Pre-tokenization for text processing pipelines.

Sourcemodule Processors : sig ... end

Post-processing module for tokenization output.

Sourcemodule Decoders : sig ... end

Decoding module for converting token IDs back to text.

Sourcemodule Trainers : sig ... end

Training module for tokenization models.

Sourcemodule Encoding : sig ... end

Encoding module - represents the output of a tokenizer

Sourcemodule Bpe : sig ... end

Byte Pair Encoding (BPE) tokenization module

Sourcemodule Wordpiece : sig ... end

WordPiece tokenization module

Enums as Polymorphic Variants

Sourcetype direction = [
  1. | `Left
  2. | `Right
]

Padding or truncation direction.

Sourcetype split_delimiter_behavior = [
  1. | `Removed
  2. | `Isolated
  3. | `Merged_with_previous
  4. | `Merged_with_next
  5. | `Contiguous
]

Behavior for splitting delimiters.

Sourcetype strategy = [
  1. | `Longest_first
  2. | `Only_first
  3. | `Only_second
]

Truncation strategy.

Sourcetype prepend_scheme = [
  1. | `Always
  2. | `Never
  3. | `First
]

Prepend scheme for metaspace.

Core Types

Sourcemodule Added_token : sig ... end
Sourcemodule Tokenizer : sig ... end