package saga
Text processing and NLP extensions for Nx
Install
dune-project
Dependency
Authors
Maintainers
Sources
raven-1.0.0.alpha1.tbz
sha256=8e277ed56615d388bc69c4333e43d1acd112b5f2d5d352e2453aef223ff59867
sha512=369eda6df6b84b08f92c8957954d107058fb8d3d8374082e074b56f3a139351b3ae6e3a99f2d4a4a2930dd950fd609593467e502368a13ad6217b571382da28c
doc/saga.tokenizers/Saga_tokenizers/Bpe/index.html
Module Saga_tokenizers.Bpe
Source
Byte Pair Encoding (BPE) tokenization module
Core Types
BPE model
List of merge operations
Source
type config = {
vocab : vocab;
merges : merges;
cache_capacity : int;
dropout : float option;
unk_token : string option;
continuing_subword_prefix : string option;
end_of_word_suffix : string option;
fuse_unk : bool;
byte_fallback : bool;
ignore_merges : bool;
}
BPE configuration
Model Creation
from_files ~vocab_file ~merges_file
loads a BPE model from vocab.json and merges.txt files
Configuration Builder
Tokenization
Token with ID, string value, and character offsets
Vocabulary Management
get_vocab model
returns the vocabulary as a list of (token, id) pairs
get_unk_token model
returns the unknown token if configured
get_continuing_subword_prefix model
returns the continuing subword prefix if configured
get_end_of_word_suffix model
returns the end-of-word suffix if configured
Cache Management
Serialization
save model ~path ?name ()
saves the model to vocab.json and merges.txt files
read_files ~vocab_file ~merges_file
reads vocabulary and merges from files
Training
sectionYPositions = computeSectionYPositions($el), 10)"
x-init="setTimeout(() => sectionYPositions = computeSectionYPositions($el), 10)"
>
On This Page