package saga

  1. Overview
  2. Docs

Module Saga_tokenizers.WordpieceSource

WordPiece tokenization module

WordPiece is the subword tokenization algorithm used by BERT. It uses a greedy longest-match-first algorithm to tokenize text.

Core Types

Sourcetype t

WordPiece model

Sourcetype vocab = (string, int) Hashtbl.t

Vocabulary mapping tokens to indices

Sourcetype config = {
  1. vocab : vocab;
  2. unk_token : string;
  3. continuing_subword_prefix : string;
  4. max_input_chars_per_word : int;
}

WordPiece configuration

Model Creation

Sourceval create : config -> t

create config creates a new WordPiece model with the given configuration

Sourceval from_file : vocab_file:string -> t

from_file ~vocab_file loads a WordPiece model from a vocab.txt file with default settings

Sourceval from_file_with_config : vocab_file:string -> unk_token:string -> continuing_subword_prefix:string -> max_input_chars_per_word:int -> t

from_file_with_config ~vocab_file ~unk_token ~continuing_subword_prefix ~max_input_chars_per_word loads a WordPiece model from a vocab.txt file with custom configuration

Sourceval default : unit -> t

default () creates a WordPiece model with default configuration

Configuration Builder

Sourcemodule Builder : sig ... end

Tokenization

Sourcetype token = {
  1. id : int;
  2. value : string;
  3. offsets : int * int;
}

Token with ID, string value, and character offsets

Sourceval tokenize : t -> string -> token list

tokenize model text tokenizes text into tokens using greedy longest-match-first

Sourceval token_to_id : t -> string -> int option

token_to_id model token returns the ID for a token

Sourceval id_to_token : t -> int -> string option

id_to_token model id returns the token for an ID

Vocabulary Management

Sourceval get_vocab : t -> (string * int) list

get_vocab model returns the vocabulary as a list of (token, id) pairs

Sourceval get_vocab_size : t -> int

get_vocab_size model returns the size of the vocabulary

Sourceval get_unk_token : t -> string

get_unk_token model returns the unknown token

Sourceval get_continuing_subword_prefix : t -> string

get_continuing_subword_prefix model returns the continuing subword prefix

Sourceval get_max_input_chars_per_word : t -> int

get_max_input_chars_per_word model returns the maximum input characters per word

Serialization

Sourceval save : t -> path:string -> ?name:string -> unit -> string

save model ~path ?name () saves the model to vocab.txt file and returns the filepath

Sourceval read_file : vocab_file:string -> vocab

read_file ~vocab_file reads vocabulary from file

Sourceval read_bytes : bytes -> vocab

read_bytes bytes reads vocabulary from bytes

Sourceval to_yojson : t -> Yojson.Basic.t

to_yojson model converts model to JSON

Sourceval of_yojson : Yojson.Basic.t -> t

of_yojson json creates model from JSON

Sourceval from_bytes : bytes -> t

from_bytes bytes creates model from serialized bytes

Training

Sourcemodule Trainer : sig ... end

Conversion

Sourceval from_bpe : Bpe.t -> t

from_bpe bpe creates a WordPiece model from a BPE model