package saga
Install
dune-project
Dependency
Authors
Maintainers
Sources
sha256=93abc49d075a1754442ccf495645bc4fdc83e4c66391ec8aca8fa15d2b4f44d2
sha512=5eb958c51f30ae46abded4c96f48d1825f79c7ce03f975f9a6237cdfed0d62c0b4a0774296694def391573d849d1f869919c49008acffca95946b818ad325f6f
doc/saga.tokenizers/Saga_tokenizers/Normalizers/index.html
Module Saga_tokenizers.NormalizersSource
Text normalization (lowercase, NFD/NFC, accent stripping, etc.).
Text normalization module matching HuggingFace tokenizers.
Normalizers are responsible for cleaning and transforming text before tokenization. This includes operations like lowercasing, accent removal, Unicode normalization, and handling special characters.
Overview
Normalization is the first stage in tokenization pipelines, applied before pre-tokenization and vocabulary-based encoding. It ensures consistent text representation, handles Unicode quirks, and removes irrelevant variations.
Common normalization operations:
- Lowercasing for case-insensitive models
- Accent removal (é → e) for language-agnostic matching
- Unicode normalization (NFC/NFD/NFKC/NFKD) for canonical representation
- Whitespace cleanup and control character removal
- BERT-specific preprocessing (CJK handling, accent stripping)
When to Use Normalization
Apply normalization when you want to:
- Reduce vocabulary size by merging case variants (Hello/hello)
- Handle accented characters uniformly (café/cafe)
- Clean noisy text (control characters, extra whitespace)
- Match model-specific preprocessing (BERT, GPT-2)
Skip normalization when you need to:
- Preserve case distinctions (proper nouns, acronyms)
- Keep accent information (é vs e are different)
- Process code or structured data with meaningful formatting
Usage Examples
Simple lowercasing:
let normalizer = Normalizers.lowercase () in
let result = Normalizers.normalize_str normalizer "Hello World!" in
(* result = "hello world!" *)BERT-style normalization:
let normalizer = Normalizers.bert ~lowercase:true () in
let result = Normalizers.normalize_str normalizer " Héllo\tWorld! " in
(* Cleans whitespace, removes accents, lowercases *)Combining multiple normalizers:
let normalizer = Normalizers.sequence [
Normalizers.nfd (); (* Decompose accented chars *)
Normalizers.strip_accents (); (* Remove accent marks *)
Normalizers.lowercase (); (* Convert to lowercase *)
Normalizers.strip ~left:true ~right:true (); (* Trim whitespace *)
] in
let result = Normalizers.normalize_str normalizer " Café " in
(* result = "cafe" *)Unicode Normalization Forms
Unicode provides four normalization forms for canonical representation:
- NFC (Canonical Composition): Decomposes then recomposes characters. Preferred for most text processing. é stored as single character U+00E9.
- NFD (Canonical Decomposition): Decomposes characters into base + combining marks. é stored as e (U+0065) + ́ (U+0301). Useful before accent removal.
- NFKC (Compatibility Composition): Replaces compatibility characters with canonical equivalents, then composes. Converts fi (ligature) → fi. Lossy but reduces variation.
- NFKD (Compatibility Decomposition): Compatibility decomposition without recomposition. Most aggressive normalization, useful for search.
Typical usage:
- Use NFC for storage and display (most compact)
- Use NFD before accent stripping
- Use NFKC/NFKD for fuzzy matching and search
type normalized_string = {normalized : string;(*The normalized text
*)original : string;(*The original text
*)alignments : (int * int) array;(*Alignment mappings from normalized to original positions
*)
}Type representing a normalized string with alignment information
Main normalizer type
Constructors
val bert :
?clean_text:bool ->
?handle_chinese_chars:bool ->
?strip_accents:bool option ->
?lowercase:bool ->
unit ->
tbert ~clean_text ~handle_chinese_chars ~strip_accents ~lowercase () creates a BERT normalizer.
strip ~left ~right () removes whitespace from text boundaries.
Trims whitespace characters from the beginning and/or end of text. Does not affect internal whitespace.
let normalizer = Normalizers.strip ~left:true ~right:true () in
let result = Normalizers.normalize_str normalizer " Hello " in
(* result = "Hello" *)strip_accents () removes accent marks from characters.
Converts accented characters to their base forms (é → e, ñ → n). Uses Unicode NFD decomposition followed by removal of combining marks. Typically applied after NFD normalization.
let normalizer = Normalizers.strip_accents () in
let result = Normalizers.normalize_str normalizer "Café résumé" in
(* result = "Cafe resume" *)nfc () applies Unicode NFC normalization.
Canonical Decomposition followed by Canonical Composition. Decomposes characters (é → e + ́), then recomposes them into precomposed forms (e + ́ → é). Produces canonical composed representation.
Use for: Standard text storage, ensuring consistent representation.
let normalizer = Normalizers.nfc () in
let result = Normalizers.normalize_str normalizer "e\u{0301}" in
(* Combining e + accent → composed é *)nfd () applies Unicode NFD normalization.
Canonical Decomposition. Splits precomposed characters into base character
- combining marks (é → e + ́). Essential before accent stripping.
Use for: Accent removal pipelines, character-level analysis.
let normalizer = Normalizers.nfd () in
let result = Normalizers.normalize_str normalizer "é" in
(* Composed é → e + combining accent *)nfkc () applies Unicode NFKC normalization.
Compatibility Decomposition followed by Canonical Composition. Replaces compatibility characters with canonical equivalents, then composes. Converts ligatures (fi → fi), full-width characters (A → A), subscripts. Lossy transformation.
Use for: Fuzzy search, aggressive text normalization.
let normalizer = Normalizers.nfkc () in
let result = Normalizers.normalize_str normalizer "file" in
(* Ligature fi → fi, result = "file" *)nfkd () applies Unicode NFKD normalization.
Compatibility Decomposition. Most aggressive Unicode normalization. Decomposes compatibility characters and canonical characters. Useful for maximum normalization and search applications.
Use for: Aggressive fuzzy matching, search indexing.
let normalizer = Normalizers.nfkd () in
let result = Normalizers.normalize_str normalizer "fi" in
(* Decomposes ligatures, compatibility forms *)lowercase () converts text to lowercase.
Applies Unicode lowercase transformation. Language-agnostic but may not handle all language-specific casing rules correctly (e.g., Turkish i).
let normalizer = Normalizers.lowercase () in
let result = Normalizers.normalize_str normalizer "Hello World!" in
(* result = "hello world!" *)replace ~pattern ~replacement () replaces text matching regex pattern.
Finds all matches of pattern and replaces them with replacement string. Useful for custom text transformations.
let normalizer = Normalizers.replace ~pattern:"[0-9]+" ~replacement:"<NUM>" () in
let result = Normalizers.normalize_str normalizer "I have 123 apples" in
(* result = "I have <NUM> apples" *)prepend ~prepend prepends string to text.
Adds fixed string to the beginning of text. Useful for adding prefixes or special markers.
let normalizer = Normalizers.prepend ~prepend:">> " in
let result = Normalizers.normalize_str normalizer "Hello" in
(* result = ">> Hello" *)byte_level ~add_prefix_space ~use_regex () applies byte-level normalization.
Converts text to byte representation using special Unicode characters. Used by GPT-2 style models for robust handling of any byte sequence.
let normalizer = Normalizers.byte_level ~add_prefix_space:true () in
let result = Normalizers.normalize_str normalizer "Hello" in
(* Converts to byte representation, adds prefix space *)sequence normalizers combines multiple normalizers into a sequence.
Applies normalizers left-to-right. Each normalizer processes the output of the previous one. Useful for building complex normalization pipelines.
let normalizer = Normalizers.sequence [
Normalizers.nfd ();
Normalizers.strip_accents ();
Normalizers.lowercase ();
] in
let result = Normalizers.normalize_str normalizer "Café" in
(* Applies: NFD decomposition → accent removal → lowercase *)
(* result = "cafe" *)Operations
normalize t text applies normalization to a string, preserving alignment information.
normalize_str t text applies normalization to a string, returning only the normalized text.
Serialization
to_json t converts normalizer to JSON representation.
of_json json creates normalizer from JSON representation.