Page
Library
Module
Module type
Parameter
Class
Class type
Source
Saga_tokenizers.NormalizersSourceText normalization (lowercase, NFD/NFC, accent stripping, etc.).
Text normalization module matching HuggingFace tokenizers.
Normalizers are responsible for cleaning and transforming text before tokenization. This includes operations like lowercasing, accent removal, Unicode normalization, and handling special characters.
Normalization is the first stage in tokenization pipelines, applied before pre-tokenization and vocabulary-based encoding. It ensures consistent text representation, handles Unicode quirks, and removes irrelevant variations.
Common normalization operations:
Apply normalization when you want to:
Skip normalization when you need to:
Simple lowercasing:
let normalizer = Normalizers.lowercase () in
let result = Normalizers.normalize_str normalizer "Hello World!" in
(* result = "hello world!" *)BERT-style normalization:
let normalizer = Normalizers.bert ~lowercase:true () in
let result = Normalizers.normalize_str normalizer " Héllo\tWorld! " in
(* Cleans whitespace, removes accents, lowercases *)Combining multiple normalizers:
let normalizer = Normalizers.sequence [
Normalizers.nfd (); (* Decompose accented chars *)
Normalizers.strip_accents (); (* Remove accent marks *)
Normalizers.lowercase (); (* Convert to lowercase *)
Normalizers.strip ~left:true ~right:true (); (* Trim whitespace *)
] in
let result = Normalizers.normalize_str normalizer " Café " in
(* result = "cafe" *)Unicode provides four normalization forms for canonical representation:
Typical usage:
type normalized_string = {normalized : string;The normalized text
*)original : string;The original text
*)alignments : (int * int) array;Alignment mappings from normalized to original positions
*)}Type representing a normalized string with alignment information
Main normalizer type
val bert :
?clean_text:bool ->
?handle_chinese_chars:bool ->
?strip_accents:bool option ->
?lowercase:bool ->
unit ->
tbert ~clean_text ~handle_chinese_chars ~strip_accents ~lowercase () creates a BERT normalizer.
strip ~left ~right () removes whitespace from text boundaries.
Trims whitespace characters from the beginning and/or end of text. Does not affect internal whitespace.
let normalizer = Normalizers.strip ~left:true ~right:true () in
let result = Normalizers.normalize_str normalizer " Hello " in
(* result = "Hello" *)strip_accents () removes accent marks from characters.
Converts accented characters to their base forms (é → e, ñ → n). Uses Unicode NFD decomposition followed by removal of combining marks. Typically applied after NFD normalization.
let normalizer = Normalizers.strip_accents () in
let result = Normalizers.normalize_str normalizer "Café résumé" in
(* result = "Cafe resume" *)nfc () applies Unicode NFC normalization.
Canonical Decomposition followed by Canonical Composition. Decomposes characters (é → e + ́), then recomposes them into precomposed forms (e + ́ → é). Produces canonical composed representation.
Use for: Standard text storage, ensuring consistent representation.
let normalizer = Normalizers.nfc () in
let result = Normalizers.normalize_str normalizer "e\u{0301}" in
(* Combining e + accent → composed é *)nfd () applies Unicode NFD normalization.
Canonical Decomposition. Splits precomposed characters into base character
Use for: Accent removal pipelines, character-level analysis.
let normalizer = Normalizers.nfd () in
let result = Normalizers.normalize_str normalizer "é" in
(* Composed é → e + combining accent *)nfkc () applies Unicode NFKC normalization.
Compatibility Decomposition followed by Canonical Composition. Replaces compatibility characters with canonical equivalents, then composes. Converts ligatures (fi → fi), full-width characters (A → A), subscripts. Lossy transformation.
Use for: Fuzzy search, aggressive text normalization.
let normalizer = Normalizers.nfkc () in
let result = Normalizers.normalize_str normalizer "file" in
(* Ligature fi → fi, result = "file" *)nfkd () applies Unicode NFKD normalization.
Compatibility Decomposition. Most aggressive Unicode normalization. Decomposes compatibility characters and canonical characters. Useful for maximum normalization and search applications.
Use for: Aggressive fuzzy matching, search indexing.
let normalizer = Normalizers.nfkd () in
let result = Normalizers.normalize_str normalizer "fi" in
(* Decomposes ligatures, compatibility forms *)lowercase () converts text to lowercase.
Applies Unicode lowercase transformation. Language-agnostic but may not handle all language-specific casing rules correctly (e.g., Turkish i).
let normalizer = Normalizers.lowercase () in
let result = Normalizers.normalize_str normalizer "Hello World!" in
(* result = "hello world!" *)replace ~pattern ~replacement () replaces text matching regex pattern.
Finds all matches of pattern and replaces them with replacement string. Useful for custom text transformations.
let normalizer = Normalizers.replace ~pattern:"[0-9]+" ~replacement:"<NUM>" () in
let result = Normalizers.normalize_str normalizer "I have 123 apples" in
(* result = "I have <NUM> apples" *)prepend ~prepend prepends string to text.
Adds fixed string to the beginning of text. Useful for adding prefixes or special markers.
let normalizer = Normalizers.prepend ~prepend:">> " in
let result = Normalizers.normalize_str normalizer "Hello" in
(* result = ">> Hello" *)byte_level ~add_prefix_space ~use_regex () applies byte-level normalization.
Converts text to byte representation using special Unicode characters. Used by GPT-2 style models for robust handling of any byte sequence.
let normalizer = Normalizers.byte_level ~add_prefix_space:true () in
let result = Normalizers.normalize_str normalizer "Hello" in
(* Converts to byte representation, adds prefix space *)sequence normalizers combines multiple normalizers into a sequence.
Applies normalizers left-to-right. Each normalizer processes the output of the previous one. Useful for building complex normalization pipelines.
let normalizer = Normalizers.sequence [
Normalizers.nfd ();
Normalizers.strip_accents ();
Normalizers.lowercase ();
] in
let result = Normalizers.normalize_str normalizer "Café" in
(* Applies: NFD decomposition → accent removal → lowercase *)
(* result = "cafe" *)normalize t text applies normalization to a string, preserving alignment information.
normalize_str t text applies normalization to a string, returning only the normalized text.
to_json t converts normalizer to JSON representation.
of_json json creates normalizer from JSON representation.