package saga
Install
dune-project
Dependency
Authors
Maintainers
Sources
sha256=8e277ed56615d388bc69c4333e43d1acd112b5f2d5d352e2453aef223ff59867
sha512=369eda6df6b84b08f92c8957954d107058fb8d3d8374082e074b56f3a139351b3ae6e3a99f2d4a4a2930dd950fd609593467e502368a13ad6217b571382da28c
doc/saga.tokenizers/Saga_tokenizers/Wordpiece/index.html
Module Saga_tokenizers.Wordpiece
Source
WordPiece tokenization module
WordPiece is the subword tokenization algorithm used by BERT. It uses a greedy longest-match-first algorithm to tokenize text.
Core Types
WordPiece model
type config = {
vocab : vocab;
unk_token : string;
continuing_subword_prefix : string;
max_input_chars_per_word : int;
}
WordPiece configuration
Model Creation
create config
creates a new WordPiece model with the given configuration
from_file ~vocab_file
loads a WordPiece model from a vocab.txt file with default settings
val from_file_with_config :
vocab_file:string ->
unk_token:string ->
continuing_subword_prefix:string ->
max_input_chars_per_word:int ->
t
from_file_with_config ~vocab_file ~unk_token ~continuing_subword_prefix ~max_input_chars_per_word
loads a WordPiece model from a vocab.txt file with custom configuration
Configuration Builder
Tokenization
Token with ID, string value, and character offsets
tokenize model text
tokenizes text into tokens using greedy longest-match-first
Vocabulary Management
get_vocab model
returns the vocabulary as a list of (token, id) pairs
get_continuing_subword_prefix model
returns the continuing subword prefix
get_max_input_chars_per_word model
returns the maximum input characters per word
Serialization
save model ~path ?name ()
saves the model to vocab.txt file and returns the filepath
to_yojson model
converts model to JSON
of_yojson json
creates model from JSON