package saga

You can search for identifiers within the package.

in-package search v0.2.0

On This Page

Constructors
Operations
Serialization

package saga

saga
- CHANGES
- README
- Library saga
  - Saga
    
    Either
    
    Unicode
    
    Models
    
    Normalizers
    
    Pre_tokenizers
    
    Processors
    
    Decoders
    
    Trainers
    
    Encoding
    
    Bpe
    
    Wordpiece
    
    Added_token
    
    Tokenizer
    
    Sampler
- Library saga.models
  - Saga_models
    
    Ngram
- Library saga.tokenizers
  - Saga_tokenizers
    
    Either
    
    Unicode
    
    Models
    
    Normalizers
    
    Pre_tokenizers
    
    Processors
    
    Decoders
    
    Trainers
    
    Encoding
    
    Bpe
    
    Builder
    
    Trainer
    
    Wordpiece
    
    Builder
    
    Trainer
    
    Added_token
    
    Tokenizer
- Sources
  - saga
    
    io.ml
    
    lm.ml
    
    saga.ml
    
    saga__.ml
    
    sampler.ml
  - saga.models
    
    ngram.ml
    
    saga_models.ml
    
    saga_models__.ml
  - saga.tokenizers
    
    bpe.ml
    
    decoders.ml
    
    encoding.ml
    
    models.ml
    
    normalizers.ml
    
    pre_tokenizers.ml
    
    processors.ml
    
    saga_tokenizers.ml
    
    saga_tokenizers__.ml
    
    trainers.ml
    
    unicode.ml
    
    wordpiece.ml

Legend:
Page
Library
Module
Module type
Parameter
Class
Class type
Source

Module `Saga_tokenizers.Normalizers`Source

Text normalization module matching HuggingFace tokenizers.

Normalizers are responsible for cleaning and transforming text before tokenization. This includes operations like lowercasing, accent removal, Unicode normalization, and handling special characters.

Sourcetype normalized_string = {

normalized : string;
(*
The normalized text
*)
original : string;
(*
The original text
*)
alignments : (int * int) array;
(*
Alignment mappings from normalized to original positions
*)

}

Type representing a normalized string with alignment information

Sourcetype t

Main normalizer type

Constructors

Source

val bert : 
  ?clean_text:bool ->
  ?handle_chinese_chars:bool ->
  ?strip_accents:bool option ->
  ?lowercase:bool ->
  unit ->
  t

Create a BERT normalizer.

parameter clean_text
Remove control characters and normalize whitespace (default: true)

parameter handle_chinese_chars
Add spaces around CJK characters (default: true)

parameter strip_accents
Strip accents (None means auto-detect based on lowercase) (default: None)

parameter lowercase
Convert to lowercase (default: true)

Sourceval strip : ?left:bool -> ?right:bool -> unit -> t

Create a strip normalizer.

parameter left
Strip whitespace from left (default: false)

parameter right
Strip whitespace from right (default: true)

Sourceval strip_accents : unit -> t

Create an accent stripping normalizer

Sourceval nfc : unit -> t

Unicode NFC (Canonical Decomposition, followed by Canonical Composition) normalizer

Sourceval nfd : unit -> t

Unicode NFD (Canonical Decomposition) normalizer

Sourceval nfkc : unit -> t

Unicode NFKC (Compatibility Decomposition, followed by Canonical Composition) normalizer

Sourceval nfkd : unit -> t

Unicode NFKD (Compatibility Decomposition) normalizer

Sourceval lowercase : unit -> t

Simple lowercase normalizer

Sourceval nmt : unit -> t

NMT normalizer - handles special spacing around punctuation

Sourceval precompiled : bytes -> t

Create a normalizer from precompiled data

Sourceval replace : pattern:string -> replacement:string -> unit -> t

Create a replace normalizer.

parameter pattern
Regex pattern to match

parameter replacement
Replacement string

Sourceval prepend : prepend:string -> t

Create a prepend normalizer.

parameter prepend
String to prepend

Sourceval byte_level : ?add_prefix_space:bool -> ?use_regex:bool -> unit -> t

Create a byte-level normalizer.

parameter add_prefix_space
Add space prefix to first word (default: false)

parameter use_regex
Use regex for splitting (default: false)

Sourceval sequence : t list -> t

Combine multiple normalizers into a sequence

Operations

Sourceval normalize : t -> string -> normalized_string

Apply normalization to a string, preserving alignment information

Sourceval normalize_str : t -> string -> string

Apply normalization to a string, returning only the normalized text

Serialization

Sourceval to_json : t -> Yojson.Basic.t

Convert normalizer to JSON representation

Sourceval of_json : Yojson.Basic.t -> t

Create normalizer from JSON representation

package saga

Module Saga_tokenizers.NormalizersSource

Constructors

Operations

Serialization

Module `Saga_tokenizers.Normalizers`Source