package saga

You can search for identifiers within the package.

in-package search v0.2.0

On This Page

Core Types
Model Creation
Configuration Builder
Tokenization
Vocabulary Management
Serialization
Training
Conversion

package saga

saga
- CHANGES
- README
- Library saga
  - Saga
    
    Either
    
    Unicode
    
    Models
    
    Normalizers
    
    Pre_tokenizers
    
    Processors
    
    Decoders
    
    Trainers
    
    Encoding
    
    Bpe
    
    Wordpiece
    
    Added_token
    
    Tokenizer
    
    Sampler
- Library saga.models
  - Saga_models
    
    Ngram
- Library saga.tokenizers
  - Saga_tokenizers
    
    Either
    
    Unicode
    
    Models
    
    Normalizers
    
    Pre_tokenizers
    
    Processors
    
    Decoders
    
    Trainers
    
    Encoding
    
    Bpe
    
    Builder
    
    Trainer
    
    Wordpiece
    
    Builder
    
    Trainer
    
    Added_token
    
    Tokenizer
- Sources
  - saga
    
    io.ml
    
    lm.ml
    
    saga.ml
    
    saga__.ml
    
    sampler.ml
  - saga.models
    
    ngram.ml
    
    saga_models.ml
    
    saga_models__.ml
  - saga.tokenizers
    
    bpe.ml
    
    decoders.ml
    
    encoding.ml
    
    models.ml
    
    normalizers.ml
    
    pre_tokenizers.ml
    
    processors.ml
    
    saga_tokenizers.ml
    
    saga_tokenizers__.ml
    
    trainers.ml
    
    unicode.ml
    
    wordpiece.ml

Legend:
Page
Library
Module
Module type
Parameter
Class
Class type
Source

Module `Saga_tokenizers.Wordpiece`Source

WordPiece tokenization module

WordPiece is the subword tokenization algorithm used by BERT. It uses a greedy longest-match-first algorithm to tokenize text.

Core Types

Sourcetype t

WordPiece model

Sourcetype vocab = (string, int) Hashtbl.t

Vocabulary mapping tokens to indices

Sourcetype config = {

vocab : vocab;
unk_token : string;
continuing_subword_prefix : string;
max_input_chars_per_word : int;

}

WordPiece configuration

Model Creation

Sourceval create : config -> t

create config creates a new WordPiece model with the given configuration

Sourceval from_file : vocab_file:string -> t

from_file ~vocab_file loads a WordPiece model from a vocab.txt file with default settings

Source

val from_file_with_config : 
  vocab_file:string ->
  unk_token:string ->
  continuing_subword_prefix:string ->
  max_input_chars_per_word:int ->
  t

from_file_with_config ~vocab_file ~unk_token ~continuing_subword_prefix ~max_input_chars_per_word loads a WordPiece model from a vocab.txt file with custom configuration

Sourceval default : unit -> t

default () creates a WordPiece model with default configuration

Configuration Builder

Sourcemodule Builder : sig ... end

Tokenization

Sourcetype token = {

id : int;
value : string;
offsets : int * int;

}

Token with ID, string value, and character offsets

Sourceval tokenize : t -> string -> token list

tokenize model text tokenizes text into tokens using greedy longest-match-first

Sourceval token_to_id : t -> string -> int option

token_to_id model token returns the ID for a token

Sourceval id_to_token : t -> int -> string option

id_to_token model id returns the token for an ID

Vocabulary Management

Sourceval get_vocab : t -> (string * int) list

get_vocab model returns the vocabulary as a list of (token, id) pairs

Sourceval get_vocab_size : t -> int

get_vocab_size model returns the size of the vocabulary

Sourceval get_unk_token : t -> string

get_unk_token model returns the unknown token

Sourceval get_continuing_subword_prefix : t -> string

get_continuing_subword_prefix model returns the continuing subword prefix

Sourceval get_max_input_chars_per_word : t -> int

get_max_input_chars_per_word model returns the maximum input characters per word

Serialization

Sourceval save : t -> path:string -> ?name:string -> unit -> string

save model ~path ?name () saves the model to vocab.txt file and returns the filepath

Sourceval read_file : vocab_file:string -> vocab

read_file ~vocab_file reads vocabulary from file

Sourceval read_bytes : bytes -> vocab

read_bytes bytes reads vocabulary from bytes

Sourceval to_yojson : t -> Yojson.Basic.t

to_yojson model converts model to JSON

Sourceval of_yojson : Yojson.Basic.t -> t

of_yojson json creates model from JSON

Sourceval from_bytes : bytes -> t

from_bytes bytes creates model from serialized bytes

Training

Sourcemodule Trainer : sig ... end

Conversion

Sourceval from_bpe : Bpe.t -> t

from_bpe bpe creates a WordPiece model from a BPE model

package saga

Module Saga_tokenizers.WordpieceSource

Core Types

Model Creation

Configuration Builder

Tokenization

Vocabulary Management

Serialization

Training

Conversion

Module `Saga_tokenizers.Wordpiece`Source