Module `Saga_tokenizers.Wordpiece`Source

WordPiece tokenization module

WordPiece is the subword tokenization algorithm used by BERT. It uses a greedy longest-match-first algorithm to tokenize text.

Core Types

Sourcetype t

WordPiece model

Sourcetype vocab = (string, int) Hashtbl.t

Vocabulary mapping tokens to indices

Sourcetype config = {

vocab : vocab;
unk_token : string;
continuing_subword_prefix : string;
max_input_chars_per_word : int;

}

WordPiece configuration

Model Creation

Sourceval create : config -> t

create config creates a new WordPiece model with the given configuration

Sourceval from_file : vocab_file:string -> t

from_file ~vocab_file loads a WordPiece model from a vocab.txt file with default settings

Source

val from_file_with_config : 
  vocab_file:string ->
  unk_token:string ->
  continuing_subword_prefix:string ->
  max_input_chars_per_word:int ->
  t

from_file_with_config ~vocab_file ~unk_token ~continuing_subword_prefix ~max_input_chars_per_word loads a WordPiece model from a vocab.txt file with custom configuration

Sourceval default : unit -> t

default () creates a WordPiece model with default configuration

Configuration Builder

Sourcemodule Builder : sig ... end

Tokenization

Sourcetype token = {

id : int;
value : string;
offsets : int * int;

}

Token with ID, string value, and character offsets

Sourceval tokenize : t -> string -> token list

tokenize model text tokenizes text into tokens using greedy longest-match-first

Sourceval token_to_id : t -> string -> int option

token_to_id model token returns the ID for a token

Sourceval id_to_token : t -> int -> string option

id_to_token model id returns the token for an ID

Vocabulary Management

Sourceval get_vocab : t -> (string * int) list

get_vocab model returns the vocabulary as a list of (token, id) pairs

Sourceval get_vocab_size : t -> int

get_vocab_size model returns the size of the vocabulary

Sourceval get_unk_token : t -> string

get_unk_token model returns the unknown token

Sourceval get_continuing_subword_prefix : t -> string

get_continuing_subword_prefix model returns the continuing subword prefix

Sourceval get_max_input_chars_per_word : t -> int

get_max_input_chars_per_word model returns the maximum input characters per word

Serialization

Sourceval save : t -> path:string -> ?name:string -> unit -> string

save model ~path ?name () saves the model to vocab.txt file and returns the filepath

Sourceval read_file : vocab_file:string -> vocab

read_file ~vocab_file reads vocabulary from file

Sourceval read_bytes : bytes -> vocab

read_bytes bytes reads vocabulary from bytes

Sourceval to_yojson : t -> Yojson.Basic.t

to_yojson model converts model to JSON

Sourceval of_yojson : Yojson.Basic.t -> t

of_yojson json creates model from JSON

Sourceval from_bytes : bytes -> t

from_bytes bytes creates model from serialized bytes

Training

Sourcemodule Trainer : sig ... end

Conversion

Sourceval from_bpe : Bpe.t -> t

from_bpe bpe creates a WordPiece model from a BPE model

Install

dune-project
Dependency

Authors

Maintainers

Sources

doc/saga.tokenizers/Saga_tokenizers/Wordpiece/index.html

Module `Saga_tokenizers.Wordpiece`Source

Core Types

Model Creation

Configuration Builder

Tokenization

Vocabulary Management

Serialization

Training

Conversion

package saga

Install

dune-project Dependency

Authors

Maintainers

Sources

doc/saga.tokenizers/Saga_tokenizers/Wordpiece/index.html

Module Saga_tokenizers.WordpieceSource

Core Types

Model Creation

Configuration Builder

Tokenization

Vocabulary Management

Serialization

Training

Conversion

dune-project
Dependency

Module `Saga_tokenizers.Wordpiece`Source