package saga
Install
    
    dune-project
 Dependency
Authors
Maintainers
Sources
sha256=8e277ed56615d388bc69c4333e43d1acd112b5f2d5d352e2453aef223ff59867
    
    
  sha512=369eda6df6b84b08f92c8957954d107058fb8d3d8374082e074b56f3a139351b3ae6e3a99f2d4a4a2930dd950fd609593467e502368a13ad6217b571382da28c
    
    
  doc/saga.tokenizers/Saga_tokenizers/Tokenizer/index.html
Module Saga_tokenizers.TokenizerSource
Main tokenizer type.
type padding_config = {- direction : direction;
- pad_id : int;
- pad_type_id : int;
- pad_token : string;
- length : int option;
- pad_to_multiple_of : int option;
}Record for padding config.
Record for truncation config.
From pretrained with result and defaults.
Configuration
Set normalizer.
Get normalizer.
Set pre-tokenizer.
Get pre-tokenizer.
Set post-processor.
Get post-processor.
Set decoder.
Get decoder.
Padding and Truncation
Enable padding with record config.
Get padding config.
Enable truncation with record config.
Get truncation config.
Vocabulary Management
Add tokens, return count added.
Add special tokens.
Get vocab list with default.
Get added tokens.
Training
Train from files.
val train_from_iterator : 
  t ->
  string Seq.t ->
  ?trainer:Trainers.t ->
  ?length:int ->
  unit ->
  unitTrain from text sequence.
val encode : 
  t ->
  sequence:(string, string list) Either.t ->
  ?pair:(string, string list) Either.t ->
  ?is_pretokenized:bool ->
  ?add_special_tokens:bool ->
  unit ->
  Encoding.tEncoding and Decoding
Encode single or pair, allowing pretokenized lists.
val encode_batch : 
  t ->
  input:
    ((string, string list) Either.t,
      (string, string list) Either.t * (string, string list) Either.t)
      Either.t
      list ->
  ?is_pretokenized:bool ->
  ?add_special_tokens:bool ->
  unit ->
  Encoding.t listBatch encode with flexible inputs.
val decode : 
  t ->
  int list ->
  ?skip_special_tokens:bool ->
  ?clean_up_tokenization_spaces:bool ->
  unit ->
  stringDecode with defaults.
val decode_batch : 
  t ->
  int list list ->
  ?skip_special_tokens:bool ->
  ?clean_up_tokenization_spaces:bool ->
  unit ->
  string listBatch decode with defaults.
val post_process : 
  t ->
  encoding:Encoding.t ->
  ?pair:Encoding.t ->
  ?add_special_tokens:bool ->
  unit ->
  Encoding.tPost-process manually.
Serialization
Save to file with pretty default.