package saga

  1. Overview
  2. Docs

Module Saga_tokenizers.ProcessorsSource

Post-processing module for tokenization output.

Post-processors handle special tokens and formatting after tokenization, such as adding CLS and SEP tokens for BERT, or handling sentence pairs.

Sourcetype encoding = {
  1. ids : int array;
  2. type_ids : int array;
  3. tokens : string array;
  4. offsets : (int * int) array;
  5. special_tokens_mask : int array;
  6. attention_mask : int array;
  7. overflowing : encoding list;
  8. sequence_ranges : (int * int * int) list;
}

Type representing an encoding to be processed

Sourcetype t

Main post-processor type

Constructors

Sourceval bert : sep:(string * int) -> cls:(string * int) -> unit -> t

Create a BERT post-processor.

  • parameter sep

    Separator token and ID

  • parameter cls

    Classification token and ID

Sourceval roberta : sep:(string * int) -> cls:(string * int) -> ?trim_offsets:bool -> ?add_prefix_space:bool -> unit -> t

Create a RoBERTa post-processor.

  • parameter sep

    Separator token and ID

  • parameter cls

    Classification token and ID

  • parameter trim_offsets

    Whether to trim offsets (default: true)

  • parameter add_prefix_space

    Whether to add prefix space (default: true)

Sourceval byte_level : ?trim_offsets:bool -> unit -> t

Create a byte-level post-processor.

  • parameter trim_offsets

    Whether to trim offsets (default: true)

Sourceval template : single:string -> ?pair:string -> ?special_tokens:(string * int) list -> unit -> t

Create a template post-processor.

  • parameter single

    Template for single sequences (e.g., "CLS $A SEP")

  • parameter pair

    Template for sequence pairs (e.g., "CLS $A SEP $B SEP")

  • parameter special_tokens

    List of special tokens with their IDs

Sourceval sequence : t list -> t

Combine multiple post-processors in sequence

Operations

Sourceval process : t -> encoding list -> add_special_tokens:bool -> encoding list

Process encodings with the post-processor.

  • parameter t

    The post-processor

  • parameter encodings

    List of encodings to process

  • parameter add_special_tokens

    Whether to add special tokens

  • returns

    Processed encodings

Sourceval added_tokens : t -> is_pair:bool -> int

Get the number of tokens added by this post-processor.

  • parameter t

    The post-processor

  • parameter is_pair

    Whether processing a pair of sequences

  • returns

    Number of added tokens

Serialization

Sourceval to_json : t -> Yojson.Basic.t

Convert post-processor to JSON representation

Sourceval of_json : Yojson.Basic.t -> t

Create post-processor from JSON representation