package saga

You can search for identifiers within the package.

in-package search v0.2.0

On This Page

Quick Start
Key Concepts
Enums as Polymorphic Variants
Core Types

saga
- CHANGES
- README
- Library saga
  - Saga
    
    Either
    
    Unicode
    
    Models
    
    Normalizers
    
    Pre_tokenizers
    
    Processors
    
    Decoders
    
    Trainers
    
    Encoding
    
    Bpe
    
    Wordpiece
    
    Added_token
    
    Tokenizer
    
    Sampler
- Library saga.models
  - Saga_models
    
    Ngram
- Library saga.tokenizers
  - Saga_tokenizers
    
    Either
    
    Unicode
    
    Models
    
    Normalizers
    
    Pre_tokenizers
    
    Processors
    
    Decoders
    
    Trainers
    
    Encoding
    
    Bpe
    
    Builder
    
    Trainer
    
    Wordpiece
    
    Builder
    
    Trainer
    
    Added_token
    
    Tokenizer
- Sources
  - saga
    
    io.ml
    
    lm.ml
    
    saga.ml
    
    saga__.ml
    
    sampler.ml
  - saga.models
    
    ngram.ml
    
    saga_models.ml
    
    saga_models__.ml
  - saga.tokenizers
    
    bpe.ml
    
    decoders.ml
    
    encoding.ml
    
    models.ml
    
    normalizers.ml
    
    pre_tokenizers.ml
    
    processors.ml
    
    saga_tokenizers.ml
    
    saga_tokenizers__.ml
    
    trainers.ml
    
    unicode.ml
    
    wordpiece.ml

Legend:
Page
Library
Module
Module type
Parameter
Class
Class type
Source

Module `Saga_tokenizers`Source

Tokenizers library - text tokenization for ML. This module provides fast and flexible tokenization for machine learning applications, supporting multiple algorithms from simple word splitting to advanced subword tokenization like BPE, Unigram, WordLevel, and WordPiece. The API is designed to match Hugging Face Tokenizers v0.21 as closely as possible, adapted to idiomatic OCaml with functional style, records for configurations, polymorphic variants for enums, default values for optionals, and result types for fallible operations. The central type is Tokenizer.t, which represents a configurable tokenization pipeline.

Quick Start

  open Saga.Tokenizers
  (* Create a character-level tokenizer *)
  let tokenizer = Tokenizer.create ~model:(Model.chars ()) in
  (* Add special tokens *)
  Tokenizer.add_special_tokens tokenizer [Added_token.create ~content:"." ~special:true ()];
  (* Train on data *)
  Tokenizer.train_from_iterator tokenizer (Seq.of_list names) ~trainer:(Trainer.chars ()) ();
  (* Encode with options *)
  let encoding = Tokenizer.encode tokenizer ~sequence:"hello world" ~add_special_tokens:true ();
  (* Get ids and decode *)
  let ids = Encoding.ids encoding in
  let text = Tokenizer.decode tokenizer ids ~skip_special_tokens:true;

Key Concepts

Tokenizer.t: The main tokenizer instance, configurable with model, normalizer, etc.
Models.t: Core tokenization algorithm (e.g., Chars, BPE).
Encoding.t: Result of encoding, with ids, tokens, offsets, masks, etc., exposed as a record.
Special tokens: Handled via add_special_tokens and encoding options.
All functions handle Unicode correctly via the Unicode module. This API aligns with Hugging Face Tokenizers v0.21 (as of 2025), including support for fast Rust-backed operations where applicable.

Sourcemodule Either : sig ... end

Either type for API compatibility.

Sourcemodule Unicode : sig ... end

Unicode utilities.

Sourcemodule Models : sig ... end

Tokenization models module.

Sourcemodule Normalizers : sig ... end

Text normalization module matching HuggingFace tokenizers.

Sourcemodule Pre_tokenizers : sig ... end

Pre-tokenization for text processing pipelines.

Sourcemodule Processors : sig ... end

Post-processing module for tokenization output.

Sourcemodule Decoders : sig ... end

Decoding module for converting token IDs back to text.

Sourcemodule Trainers : sig ... end

Training module for tokenization models.

Sourcemodule Encoding : sig ... end

Encoding module - represents the output of a tokenizer

Sourcemodule Bpe : sig ... end

Byte Pair Encoding (BPE) tokenization module

Sourcemodule Wordpiece : sig ... end

WordPiece tokenization module

Enums as Polymorphic Variants

Sourcetype direction = [

| `Left
| `Right

]

Padding or truncation direction.

Sourcetype split_delimiter_behavior = [

| `Removed
| `Isolated
| `Merged_with_previous
| `Merged_with_next
| `Contiguous

]

Behavior for splitting delimiters.

Sourcetype strategy = [

| `Longest_first
| `Only_first
| `Only_second

]

Truncation strategy.

Sourcetype prepend_scheme = [

| `Always
| `Never
| `First

]

Prepend scheme for metaspace.

Core Types

Sourcemodule Added_token : sig ... end

Sourcemodule Tokenizer : sig ... end