package saga

You can search for identifiers within the package.

in-package search v0.2.0

On This Page

Quick Start
Key Concepts
Enums as Polymorphic Variants
Core Types

Text processing and NLP extensions for Nx

Install

dune-project

Dependency

github.com Readme Changelog Edit opam file Versions (1)

Authors

Thibaut Mattio

Maintainers

Thibaut Mattio

Sources

raven-1.0.0.alpha1.tbz

sha256=8e277ed56615d388bc69c4333e43d1acd112b5f2d5d352e2453aef223ff59867

sha512=369eda6df6b84b08f92c8957954d107058fb8d3d8374082e074b56f3a139351b3ae6e3a99f2d4a4a2930dd950fd609593467e502368a13ad6217b571382da28c

doc/saga.tokenizers/Saga_tokenizers/index.html

Module `Saga_tokenizers`Source

Tokenizers library - text tokenization for ML. This module provides fast and flexible tokenization for machine learning applications, supporting multiple algorithms from simple word splitting to advanced subword tokenization like BPE, Unigram, WordLevel, and WordPiece. The API is designed to match Hugging Face Tokenizers v0.21 as closely as possible, adapted to idiomatic OCaml with functional style, records for configurations, polymorphic variants for enums, default values for optionals, and result types for fallible operations. The central type is Tokenizer.t, which represents a configurable tokenization pipeline.

Quick Start

  open Saga.Tokenizers
  (* Create a character-level tokenizer *)
  let tokenizer = Tokenizer.create ~model:(Model.chars ()) in
  (* Add special tokens *)
  Tokenizer.add_special_tokens tokenizer [Added_token.create ~content:"." ~special:true ()];
  (* Train on data *)
  Tokenizer.train_from_iterator tokenizer (Seq.of_list names) ~trainer:(Trainer.chars ()) ();
  (* Encode with options *)
  let encoding = Tokenizer.encode tokenizer ~sequence:"hello world" ~add_special_tokens:true ();
  (* Get ids and decode *)
  let ids = Encoding.ids encoding in
  let text = Tokenizer.decode tokenizer ids ~skip_special_tokens:true;

Key Concepts

Tokenizer.t: The main tokenizer instance, configurable with model, normalizer, etc.
Models.t: Core tokenization algorithm (e.g., Chars, BPE).
Encoding.t: Result of encoding, with ids, tokens, offsets, masks, etc., exposed as a record.
Special tokens: Handled via add_special_tokens and encoding options.
All functions handle Unicode correctly via the Unicode module. This API aligns with Hugging Face Tokenizers v0.21 (as of 2025), including support for fast Rust-backed operations where applicable.

Sourcemodule Either : sig ... end

Either type for API compatibility.

Sourcemodule Unicode : sig ... end

Unicode utilities.

Sourcemodule Models : sig ... end

Tokenization models module.

Sourcemodule Normalizers : sig ... end

Text normalization module matching HuggingFace tokenizers.

Sourcemodule Pre_tokenizers : sig ... end

Pre-tokenization for text processing pipelines.

Sourcemodule Processors : sig ... end

Post-processing module for tokenization output.

Sourcemodule Decoders : sig ... end

Decoding module for converting token IDs back to text.

Sourcemodule Trainers : sig ... end

Training module for tokenization models.

Sourcemodule Encoding : sig ... end

Encoding module - represents the output of a tokenizer

Sourcemodule Bpe : sig ... end

Byte Pair Encoding (BPE) tokenization module

Sourcemodule Wordpiece : sig ... end

WordPiece tokenization module

Enums as Polymorphic Variants

Sourcetype direction = [

| `Left
| `Right

]

Padding or truncation direction.

Sourcetype split_delimiter_behavior = [

| `Removed
| `Isolated
| `Merged_with_previous
| `Merged_with_next
| `Contiguous

]

Behavior for splitting delimiters.

Sourcetype strategy = [

| `Longest_first
| `Only_first
| `Only_second

]

Truncation strategy.

Sourcetype prepend_scheme = [

| `Always
| `Never
| `First

]

Prepend scheme for metaspace.

Core Types

Sourcemodule Added_token : sig ... end

Sourcemodule Tokenizer : sig ... end