Module `Saga_tokenizers.Processors`Source

Post-processing module for tokenization output.

Post-processors handle special tokens and formatting after tokenization, such as adding CLS and SEP tokens for BERT, or handling sentence pairs.

Sourcetype encoding = {

ids : int array;
type_ids : int array;
tokens : string array;
offsets : (int * int) array;
special_tokens_mask : int array;
attention_mask : int array;
overflowing : encoding list;
sequence_ranges : (int * int * int) list;

}

Type representing an encoding to be processed

Sourcetype t

Main post-processor type

Constructors

Sourceval bert : sep:(string * int) -> cls:(string * int) -> unit -> t

Create a BERT post-processor.

parameter sep
Separator token and ID

parameter cls
Classification token and ID

Source

val roberta : 
  sep:(string * int) ->
  cls:(string * int) ->
  ?trim_offsets:bool ->
  ?add_prefix_space:bool ->
  unit ->
  t

Create a RoBERTa post-processor.

parameter sep
Separator token and ID

parameter cls
Classification token and ID

parameter trim_offsets
Whether to trim offsets (default: true)

parameter add_prefix_space
Whether to add prefix space (default: true)

Sourceval byte_level : ?trim_offsets:bool -> unit -> t

Create a byte-level post-processor.

parameter trim_offsets
Whether to trim offsets (default: true)

Source

val template : 
  single:string ->
  ?pair:string ->
  ?special_tokens:(string * int) list ->
  unit ->
  t

Create a template post-processor.

parameter single
Template for single sequences (e.g., "CLS $A SEP")

parameter pair
Template for sequence pairs (e.g., "CLS $A SEP $B SEP")

parameter special_tokens
List of special tokens with their IDs

Sourceval sequence : t list -> t

Combine multiple post-processors in sequence

Operations

Sourceval process : t -> encoding list -> add_special_tokens:bool -> encoding list

Process encodings with the post-processor.

parameter t
The post-processor

parameter encodings
List of encodings to process

parameter add_special_tokens
Whether to add special tokens

returns
Processed encodings

Sourceval added_tokens : t -> is_pair:bool -> int

Get the number of tokens added by this post-processor.

parameter t
The post-processor

parameter is_pair
Whether processing a pair of sequences

returns
Number of added tokens

Serialization

Sourceval to_json : t -> Yojson.Basic.t

Convert post-processor to JSON representation

Sourceval of_json : Yojson.Basic.t -> t

Create post-processor from JSON representation

Install

dune-project
Dependency

Authors

Maintainers

Sources

doc/saga.tokenizers/Saga_tokenizers/Processors/index.html

Module `Saga_tokenizers.Processors`Source

Constructors

Operations

Serialization

package saga

Install

dune-project Dependency

Authors

Maintainers

Sources

doc/saga.tokenizers/Saga_tokenizers/Processors/index.html

Module Saga_tokenizers.ProcessorsSource

Constructors

Operations

Serialization

dune-project
Dependency

Module `Saga_tokenizers.Processors`Source