package saga

  1. Overview
  2. Docs

Module Saga_tokenizersSource

Tokenization library for Saga.

This module provides the main tokenization API matching HuggingFace Tokenizers design. It supports multiple tokenization algorithms (BPE, WordPiece, Unigram, Word-level, Character-level), text normalization, pre-tokenization, post-processing, and decoding.

Quick Start

Load a pretrained tokenizer:

  let tokenizer = Tokenizer.from_file "tokenizer.json" |> Result.get_ok in
  let encoding = Tokenizer.encode tokenizer "Hello world!" in
  let ids = Encoding.get_ids encoding

Create a BPE tokenizer from scratch:

  let tokenizer =
    Tokenizer.bpe
      ~vocab:[("hello", 0); ("world", 1); ("[PAD]", 2)]
      ~merges:[]
      ()
  in
  let encoding = Tokenizer.encode tokenizer "hello world" in
  let text = Tokenizer.decode tokenizer [0; 1]

Train a new tokenizer:

  let texts = [ "Hello world"; "How are you?"; "Hello again" ] in
  let tokenizer =
    Tokenizer.train_bpe (`Seq (List.to_seq texts)) ~vocab_size:1000 ()
  in
  Tokenizer.save_pretrained tokenizer ~path:"./my_tokenizer"

Architecture

Tokenization proceeds through stages:

  • Normalization: Clean and normalize text (lowercase, accent removal, etc.)
  • Pre-tokenization: Split text into words or subwords
  • Tokenization: Apply vocabulary-based encoding (BPE, WordPiece, etc.)
  • Post-processing: Add special tokens, set type IDs
  • Padding/Truncation: Adjust length for batching

Each stage is optional and configurable via builder methods.

Post-processing patterns are model-specific:

  • BERT: Adds CLS at start, SEP at end, type IDs distinguish sequences
  • GPT-2: No special tokens by default, uses BOS/EOS if configured
  • RoBERTa: Uses <s> and </s> tokens similar to BERT but different format
Sourcemodule Unicode : sig ... end

Unicode utilities for normalization.

Sourcemodule Normalizers : sig ... end

Text normalization (lowercase, NFD/NFC, accent stripping, etc.).

Sourcemodule Pre_tokenizers : sig ... end

Pre-tokenization (whitespace splitting, punctuation handling, etc.).

Sourcemodule Processors : sig ... end

Post-processing (adding CLS/SEP, setting type IDs, etc.).

Sourcemodule Decoders : sig ... end

Decoding token IDs back to text.

Sourcemodule Encoding : sig ... end

Encoding representation (output of tokenization).

Sourcetype direction = [
  1. | `Left
  2. | `Right
]

Direction for padding or truncation: `Left (beginning) or `Right (end).

Sourcetype special = {
  1. token : string;
    (*

    The token text (e.g., "<pad>", "<unk>").

    *)
  2. single_word : bool;
    (*

    Whether this token must match whole words only. Default: false.

    *)
  3. lstrip : bool;
    (*

    Whether to strip whitespace on the left. Default: false.

    *)
  4. rstrip : bool;
    (*

    Whether to strip whitespace on the right. Default: false.

    *)
  5. normalized : bool;
    (*

    Whether to apply normalization to this token. Default: true for regular tokens, false for special tokens.

    *)
}

Special token configuration.

Special tokens are not split during tokenization and can be skipped during decoding. Token IDs are assigned automatically when added to the vocabulary.

All special token types are uniform - the semantic meaning (pad, unk, bos, etc.) is contextual, not encoded in the type.

Sourcetype pad_length = [
  1. | `Batch_longest
  2. | `Fixed of int
  3. | `To_multiple of int
]

Padding length strategy.

  • `Batch_longest: Pad to longest sequence in batch
  • `Fixed n: Pad all sequences to fixed length n
  • `To_multiple n: Pad to smallest multiple of n >= sequence length
Sourcetype padding = {
  1. length : pad_length;
  2. direction : direction;
  3. pad_id : int option;
  4. pad_type_id : int option;
  5. pad_token : string option;
}

Padding configuration.

When optional fields are None, falls back to tokenizer's configured padding token. If the tokenizer has no padding token configured and these fields are None, padding operations will raise Invalid_argument.

Sourcetype truncation = {
  1. max_length : int;
  2. direction : direction;
}

Truncation configuration.

Limits sequences to max_length tokens, removing from specified direction.

Sourcetype data = [
  1. | `Files of string list
  2. | `Seq of string Seq.t
  3. | `Iterator of unit -> string option
]

Training data source.

  • `Files paths: Read training text from files
  • `Seq seq: Use sequence of strings
  • `Iterator f: Pull training data via iterator (None signals end)

Special Token Constructors

Sourcemodule Special : sig ... end
Sourcemodule Tokenizer : sig ... end