package saga

  1. Overview
  2. Docs
Text processing and NLP extensions for Nx

Install

dune-project
 Dependency

Authors

Maintainers

Sources

raven-1.0.0.alpha2.tbz
sha256=93abc49d075a1754442ccf495645bc4fdc83e4c66391ec8aca8fa15d2b4f44d2
sha512=5eb958c51f30ae46abded4c96f48d1825f79c7ce03f975f9a6237cdfed0d62c0b4a0774296694def391573d849d1f869919c49008acffca95946b818ad325f6f

doc/saga.tokenizers/Saga_tokenizers/Normalizers/index.html

Module Saga_tokenizers.NormalizersSource

Text normalization (lowercase, NFD/NFC, accent stripping, etc.).

Text normalization module matching HuggingFace tokenizers.

Normalizers are responsible for cleaning and transforming text before tokenization. This includes operations like lowercasing, accent removal, Unicode normalization, and handling special characters.

Overview

Normalization is the first stage in tokenization pipelines, applied before pre-tokenization and vocabulary-based encoding. It ensures consistent text representation, handles Unicode quirks, and removes irrelevant variations.

Common normalization operations:

  • Lowercasing for case-insensitive models
  • Accent removal (é → e) for language-agnostic matching
  • Unicode normalization (NFC/NFD/NFKC/NFKD) for canonical representation
  • Whitespace cleanup and control character removal
  • BERT-specific preprocessing (CJK handling, accent stripping)

When to Use Normalization

Apply normalization when you want to:

  • Reduce vocabulary size by merging case variants (Hello/hello)
  • Handle accented characters uniformly (café/cafe)
  • Clean noisy text (control characters, extra whitespace)
  • Match model-specific preprocessing (BERT, GPT-2)

Skip normalization when you need to:

  • Preserve case distinctions (proper nouns, acronyms)
  • Keep accent information (é vs e are different)
  • Process code or structured data with meaningful formatting

Usage Examples

Simple lowercasing:

  let normalizer = Normalizers.lowercase () in
  let result = Normalizers.normalize_str normalizer "Hello World!" in
  (* result = "hello world!" *)

BERT-style normalization:

  let normalizer = Normalizers.bert ~lowercase:true () in
  let result = Normalizers.normalize_str normalizer "  Héllo\tWorld!  " in
  (* Cleans whitespace, removes accents, lowercases *)

Combining multiple normalizers:

  let normalizer = Normalizers.sequence [
    Normalizers.nfd ();  (* Decompose accented chars *)
    Normalizers.strip_accents ();  (* Remove accent marks *)
    Normalizers.lowercase ();  (* Convert to lowercase *)
    Normalizers.strip ~left:true ~right:true ();  (* Trim whitespace *)
  ] in
  let result = Normalizers.normalize_str normalizer "  Café  " in
  (* result = "cafe" *)

Unicode Normalization Forms

Unicode provides four normalization forms for canonical representation:

  • NFC (Canonical Composition): Decomposes then recomposes characters. Preferred for most text processing. é stored as single character U+00E9.
  • NFD (Canonical Decomposition): Decomposes characters into base + combining marks. é stored as e (U+0065) + ́ (U+0301). Useful before accent removal.
  • NFKC (Compatibility Composition): Replaces compatibility characters with canonical equivalents, then composes. Converts fi (ligature) → fi. Lossy but reduces variation.
  • NFKD (Compatibility Decomposition): Compatibility decomposition without recomposition. Most aggressive normalization, useful for search.

Typical usage:

  • Use NFC for storage and display (most compact)
  • Use NFD before accent stripping
  • Use NFKC/NFKD for fuzzy matching and search
Sourcetype normalized_string = {
  1. normalized : string;
    (*

    The normalized text

    *)
  2. original : string;
    (*

    The original text

    *)
  3. alignments : (int * int) array;
    (*

    Alignment mappings from normalized to original positions

    *)
}

Type representing a normalized string with alignment information

Sourcetype t

Main normalizer type

Constructors

Sourceval bert : ?clean_text:bool -> ?handle_chinese_chars:bool -> ?strip_accents:bool option -> ?lowercase:bool -> unit -> t

bert ~clean_text ~handle_chinese_chars ~strip_accents ~lowercase () creates a BERT normalizer.

  • parameter clean_text

    Remove control characters and normalize whitespace (default: true)

  • parameter handle_chinese_chars

    Add spaces around CJK characters (default: true)

  • parameter strip_accents

    Strip accents (None means auto-detect based on lowercase) (default: None)

  • parameter lowercase

    Convert to lowercase (default: true)

Sourceval strip : ?left:bool -> ?right:bool -> unit -> t

strip ~left ~right () removes whitespace from text boundaries.

Trims whitespace characters from the beginning and/or end of text. Does not affect internal whitespace.

  • parameter left

    Strip whitespace from left (beginning). Default: true.

  • parameter right

    Strip whitespace from right (end). Default: true.

  let normalizer = Normalizers.strip ~left:true ~right:true () in
  let result = Normalizers.normalize_str normalizer "  Hello  " in
  (* result = "Hello" *)
Sourceval strip_accents : unit -> t

strip_accents () removes accent marks from characters.

Converts accented characters to their base forms (é → e, ñ → n). Uses Unicode NFD decomposition followed by removal of combining marks. Typically applied after NFD normalization.

  let normalizer = Normalizers.strip_accents () in
  let result = Normalizers.normalize_str normalizer "Café résumé" in
  (* result = "Cafe resume" *)
Sourceval nfc : unit -> t

nfc () applies Unicode NFC normalization.

Canonical Decomposition followed by Canonical Composition. Decomposes characters (é → e + ́), then recomposes them into precomposed forms (e + ́ → é). Produces canonical composed representation.

Use for: Standard text storage, ensuring consistent representation.

  let normalizer = Normalizers.nfc () in
  let result = Normalizers.normalize_str normalizer "e\u{0301}" in
  (* Combining e + accent → composed é *)
Sourceval nfd : unit -> t

nfd () applies Unicode NFD normalization.

Canonical Decomposition. Splits precomposed characters into base character

  1. combining marks (é → e + ́). Essential before accent stripping.

Use for: Accent removal pipelines, character-level analysis.

  let normalizer = Normalizers.nfd () in
  let result = Normalizers.normalize_str normalizer "é" in
  (* Composed é → e + combining accent *)
Sourceval nfkc : unit -> t

nfkc () applies Unicode NFKC normalization.

Compatibility Decomposition followed by Canonical Composition. Replaces compatibility characters with canonical equivalents, then composes. Converts ligatures (fi → fi), full-width characters (A → A), subscripts. Lossy transformation.

Use for: Fuzzy search, aggressive text normalization.

  let normalizer = Normalizers.nfkc () in
  let result = Normalizers.normalize_str normalizer "file" in
  (* Ligature fi → fi, result = "file" *)
Sourceval nfkd : unit -> t

nfkd () applies Unicode NFKD normalization.

Compatibility Decomposition. Most aggressive Unicode normalization. Decomposes compatibility characters and canonical characters. Useful for maximum normalization and search applications.

Use for: Aggressive fuzzy matching, search indexing.

  let normalizer = Normalizers.nfkd () in
  let result = Normalizers.normalize_str normalizer "fi" in
  (* Decomposes ligatures, compatibility forms *)
Sourceval lowercase : unit -> t

lowercase () converts text to lowercase.

Applies Unicode lowercase transformation. Language-agnostic but may not handle all language-specific casing rules correctly (e.g., Turkish i).

  let normalizer = Normalizers.lowercase () in
  let result = Normalizers.normalize_str normalizer "Hello World!" in
  (* result = "hello world!" *)
Sourceval replace : pattern:string -> replacement:string -> unit -> t

replace ~pattern ~replacement () replaces text matching regex pattern.

Finds all matches of pattern and replaces them with replacement string. Useful for custom text transformations.

  • parameter pattern

    Regular expression pattern (Str module syntax)

  • parameter replacement

    Replacement string (supports backreferences)

  let normalizer = Normalizers.replace ~pattern:"[0-9]+" ~replacement:"<NUM>" () in
  let result = Normalizers.normalize_str normalizer "I have 123 apples" in
  (* result = "I have <NUM> apples" *)
Sourceval prepend : prepend:string -> t

prepend ~prepend prepends string to text.

Adds fixed string to the beginning of text. Useful for adding prefixes or special markers.

  • parameter prepend

    String to add at beginning

  let normalizer = Normalizers.prepend ~prepend:">> " in
  let result = Normalizers.normalize_str normalizer "Hello" in
  (* result = ">> Hello" *)
Sourceval byte_level : ?add_prefix_space:bool -> ?use_regex:bool -> unit -> t

byte_level ~add_prefix_space ~use_regex () applies byte-level normalization.

Converts text to byte representation using special Unicode characters. Used by GPT-2 style models for robust handling of any byte sequence.

  • parameter add_prefix_space

    Add space prefix to text if it doesn't start with whitespace. Default: false.

  • parameter use_regex

    Use regex-based processing. Default: false.

  let normalizer = Normalizers.byte_level ~add_prefix_space:true () in
  let result = Normalizers.normalize_str normalizer "Hello" in
  (* Converts to byte representation, adds prefix space *)
Sourceval sequence : t list -> t

sequence normalizers combines multiple normalizers into a sequence.

Applies normalizers left-to-right. Each normalizer processes the output of the previous one. Useful for building complex normalization pipelines.

  • parameter normalizers

    List of normalizers to apply in order

  let normalizer = Normalizers.sequence [
    Normalizers.nfd ();
    Normalizers.strip_accents ();
    Normalizers.lowercase ();
  ] in
  let result = Normalizers.normalize_str normalizer "Café" in
  (* Applies: NFD decomposition → accent removal → lowercase *)
  (* result = "cafe" *)

Operations

Sourceval normalize : t -> string -> normalized_string

normalize t text applies normalization to a string, preserving alignment information.

Sourceval normalize_str : t -> string -> string

normalize_str t text applies normalization to a string, returning only the normalized text.

Serialization

Sourceval to_json : t -> Yojson.Basic.t

to_json t converts normalizer to JSON representation.

Sourceval of_json : Yojson.Basic.t -> t

of_json json creates normalizer from JSON representation.