package saga
Install
dune-project
Dependency
Authors
Maintainers
Sources
sha256=8e277ed56615d388bc69c4333e43d1acd112b5f2d5d352e2453aef223ff59867
sha512=369eda6df6b84b08f92c8957954d107058fb8d3d8374082e074b56f3a139351b3ae6e3a99f2d4a4a2930dd950fd609593467e502368a13ad6217b571382da28c
doc/saga.tokenizers/Saga_tokenizers/Tokenizer/index.html
Module Saga_tokenizers.Tokenizer
Source
Main tokenizer type.
type padding_config = {
direction : direction;
pad_id : int;
pad_type_id : int;
pad_token : string;
length : int option;
pad_to_multiple_of : int option;
}
Record for padding config.
Record for truncation config.
From pretrained with result and defaults.
Configuration
Set normalizer.
Get normalizer.
Set pre-tokenizer.
Get pre-tokenizer.
Set post-processor.
Get post-processor.
Set decoder.
Get decoder.
Padding and Truncation
Enable padding with record config.
Get padding config.
Enable truncation with record config.
Get truncation config.
Vocabulary Management
Add tokens, return count added.
Add special tokens.
Get vocab list with default.
Get added tokens.
Training
Train from files.
val train_from_iterator :
t ->
string Seq.t ->
?trainer:Trainers.t ->
?length:int ->
unit ->
unit
Train from text sequence.
val encode :
t ->
sequence:(string, string list) Either.t ->
?pair:(string, string list) Either.t ->
?is_pretokenized:bool ->
?add_special_tokens:bool ->
unit ->
Encoding.t
Encoding and Decoding
Encode single or pair, allowing pretokenized lists.
val encode_batch :
t ->
input:
((string, string list) Either.t,
(string, string list) Either.t * (string, string list) Either.t)
Either.t
list ->
?is_pretokenized:bool ->
?add_special_tokens:bool ->
unit ->
Encoding.t list
Batch encode with flexible inputs.
val decode :
t ->
int list ->
?skip_special_tokens:bool ->
?clean_up_tokenization_spaces:bool ->
unit ->
string
Decode with defaults.
val decode_batch :
t ->
int list list ->
?skip_special_tokens:bool ->
?clean_up_tokenization_spaces:bool ->
unit ->
string list
Batch decode with defaults.
val post_process :
t ->
encoding:Encoding.t ->
?pair:Encoding.t ->
?add_special_tokens:bool ->
unit ->
Encoding.t
Post-process manually.
Serialization
Save to file with pretty default.