package saga

  1. Overview
  2. Docs

Module Saga_tokenizers.NormalizersSource

Text normalization module matching HuggingFace tokenizers.

Normalizers are responsible for cleaning and transforming text before tokenization. This includes operations like lowercasing, accent removal, Unicode normalization, and handling special characters.

Sourcetype normalized_string = {
  1. normalized : string;
    (*

    The normalized text

    *)
  2. original : string;
    (*

    The original text

    *)
  3. alignments : (int * int) array;
    (*

    Alignment mappings from normalized to original positions

    *)
}

Type representing a normalized string with alignment information

Sourcetype t

Main normalizer type

Constructors

Sourceval bert : ?clean_text:bool -> ?handle_chinese_chars:bool -> ?strip_accents:bool option -> ?lowercase:bool -> unit -> t

Create a BERT normalizer.

  • parameter clean_text

    Remove control characters and normalize whitespace (default: true)

  • parameter handle_chinese_chars

    Add spaces around CJK characters (default: true)

  • parameter strip_accents

    Strip accents (None means auto-detect based on lowercase) (default: None)

  • parameter lowercase

    Convert to lowercase (default: true)

Sourceval strip : ?left:bool -> ?right:bool -> unit -> t

Create a strip normalizer.

  • parameter left

    Strip whitespace from left (default: false)

  • parameter right

    Strip whitespace from right (default: true)

Sourceval strip_accents : unit -> t

Create an accent stripping normalizer

Sourceval nfc : unit -> t

Unicode NFC (Canonical Decomposition, followed by Canonical Composition) normalizer

Sourceval nfd : unit -> t

Unicode NFD (Canonical Decomposition) normalizer

Sourceval nfkc : unit -> t

Unicode NFKC (Compatibility Decomposition, followed by Canonical Composition) normalizer

Sourceval nfkd : unit -> t

Unicode NFKD (Compatibility Decomposition) normalizer

Sourceval lowercase : unit -> t

Simple lowercase normalizer

Sourceval nmt : unit -> t

NMT normalizer - handles special spacing around punctuation

Sourceval precompiled : bytes -> t

Create a normalizer from precompiled data

Sourceval replace : pattern:string -> replacement:string -> unit -> t

Create a replace normalizer.

  • parameter pattern

    Regex pattern to match

  • parameter replacement

    Replacement string

Sourceval prepend : prepend:string -> t

Create a prepend normalizer.

  • parameter prepend

    String to prepend

Sourceval byte_level : ?add_prefix_space:bool -> ?use_regex:bool -> unit -> t

Create a byte-level normalizer.

  • parameter add_prefix_space

    Add space prefix to first word (default: false)

  • parameter use_regex

    Use regex for splitting (default: false)

Sourceval sequence : t list -> t

Combine multiple normalizers into a sequence

Operations

Sourceval normalize : t -> string -> normalized_string

Apply normalization to a string, preserving alignment information

Sourceval normalize_str : t -> string -> string

Apply normalization to a string, returning only the normalized text

Serialization

Sourceval to_json : t -> Yojson.Basic.t

Convert normalizer to JSON representation

Sourceval of_json : Yojson.Basic.t -> t

Create normalizer from JSON representation