package saga
Install
dune-project
Dependency
Authors
Maintainers
Sources
sha256=8e277ed56615d388bc69c4333e43d1acd112b5f2d5d352e2453aef223ff59867
sha512=369eda6df6b84b08f92c8957954d107058fb8d3d8374082e074b56f3a139351b3ae6e3a99f2d4a4a2930dd950fd609593467e502368a13ad6217b571382da28c
doc/saga.tokenizers/Saga_tokenizers/Normalizers/index.html
Module Saga_tokenizers.Normalizers
Source
Text normalization module matching HuggingFace tokenizers.
Normalizers are responsible for cleaning and transforming text before tokenization. This includes operations like lowercasing, accent removal, Unicode normalization, and handling special characters.
type normalized_string = {
normalized : string;
(*The normalized text
*)original : string;
(*The original text
*)alignments : (int * int) array;
(*Alignment mappings from normalized to original positions
*)
}
Type representing a normalized string with alignment information
Main normalizer type
Constructors
val bert :
?clean_text:bool ->
?handle_chinese_chars:bool ->
?strip_accents:bool option ->
?lowercase:bool ->
unit ->
t
Create a BERT normalizer.
Unicode NFC (Canonical Decomposition, followed by Canonical Composition) normalizer
Unicode NFKC (Compatibility Decomposition, followed by Canonical Composition) normalizer
Create a byte-level normalizer.
Operations
Apply normalization to a string, preserving alignment information
Apply normalization to a string, returning only the normalized text
Serialization
Convert normalizer to JSON representation
Create normalizer from JSON representation