package saga

You can search for identifiers within the package.

in-package search v0.2.0

On This Page

Core Types
Model Creation
Configuration Builder
Tokenization
Vocabulary Management
Cache Management
Serialization
Training

package saga

saga
- CHANGES
- README
- Library saga
  - Saga
    
    Either
    
    Unicode
    
    Models
    
    Normalizers
    
    Pre_tokenizers
    
    Processors
    
    Decoders
    
    Trainers
    
    Encoding
    
    Bpe
    
    Wordpiece
    
    Added_token
    
    Tokenizer
    
    Sampler
- Library saga.models
  - Saga_models
    
    Ngram
- Library saga.tokenizers
  - Saga_tokenizers
    
    Either
    
    Unicode
    
    Models
    
    Normalizers
    
    Pre_tokenizers
    
    Processors
    
    Decoders
    
    Trainers
    
    Encoding
    
    Bpe
    
    Builder
    
    Trainer
    
    Wordpiece
    
    Builder
    
    Trainer
    
    Added_token
    
    Tokenizer
- Sources
  - saga
    
    io.ml
    
    lm.ml
    
    saga.ml
    
    saga__.ml
    
    sampler.ml
  - saga.models
    
    ngram.ml
    
    saga_models.ml
    
    saga_models__.ml
  - saga.tokenizers
    
    bpe.ml
    
    decoders.ml
    
    encoding.ml
    
    models.ml
    
    normalizers.ml
    
    pre_tokenizers.ml
    
    processors.ml
    
    saga_tokenizers.ml
    
    saga_tokenizers__.ml
    
    trainers.ml
    
    unicode.ml
    
    wordpiece.ml

Legend:
Page
Library
Module
Module type
Parameter
Class
Class type
Source

Module `Saga_tokenizers.Bpe`Source

Byte Pair Encoding (BPE) tokenization module

Core Types

Sourcetype t

BPE model

Sourcetype vocab = (string, int) Hashtbl.t

Vocabulary mapping tokens to indices

Sourcetype merges = (string * string) list

List of merge operations

Sourcetype config = {

vocab : vocab;
merges : merges;
cache_capacity : int;
dropout : float option;
unk_token : string option;
continuing_subword_prefix : string option;
end_of_word_suffix : string option;
fuse_unk : bool;
byte_fallback : bool;
ignore_merges : bool;

}

BPE configuration

Model Creation

Sourceval create : config -> t

create config creates a new BPE model with the given configuration

Sourceval from_files : vocab_file:string -> merges_file:string -> t

from_files ~vocab_file ~merges_file loads a BPE model from vocab.json and merges.txt files

Sourceval default : unit -> t

default () creates a BPE model with default configuration

Configuration Builder

Sourcemodule Builder : sig ... end

Tokenization

Sourcetype token = {

id : int;
value : string;
offsets : int * int;

}

Token with ID, string value, and character offsets

Sourceval tokenize : t -> string -> token list

tokenize model text tokenizes text into tokens

Sourceval token_to_id : t -> string -> int option

token_to_id model token returns the ID for a token

Sourceval id_to_token : t -> int -> string option

id_to_token model id returns the token for an ID

Vocabulary Management

Sourceval get_vocab : t -> (string * int) list

get_vocab model returns the vocabulary as a list of (token, id) pairs

Sourceval get_vocab_size : t -> int

get_vocab_size model returns the size of the vocabulary

Sourceval get_unk_token : t -> string option

get_unk_token model returns the unknown token if configured

Sourceval get_continuing_subword_prefix : t -> string option

get_continuing_subword_prefix model returns the continuing subword prefix if configured

Sourceval get_end_of_word_suffix : t -> string option

get_end_of_word_suffix model returns the end-of-word suffix if configured

Cache Management

Sourceval clear_cache : t -> unit

clear_cache model clears the tokenization cache

Sourceval resize_cache : t -> int -> unit

resize_cache model capacity resizes the cache

Serialization

Sourceval save : t -> path:string -> ?name:string -> unit -> unit

save model ~path ?name () saves the model to vocab.json and merges.txt files

Sourceval read_files : vocab_file:string -> merges_file:string -> vocab * merges

read_files ~vocab_file ~merges_file reads vocabulary and merges from files

Training

Sourcemodule Trainer : sig ... end

package saga

Module Saga_tokenizers.BpeSource

Core Types

Model Creation

Configuration Builder

Tokenization

Vocabulary Management

Cache Management

Serialization

Training

Module `Saga_tokenizers.Bpe`Source