package saga

  1. Overview
  2. Docs

Module Saga_tokenizers.ProcessorsSource

Post-processing (adding CLS/SEP, setting type IDs, etc.).

Post-processing tokenization output with special tokens.

Post-processors add special tokens and formatting to tokenized sequences after the core tokenization step. They handle model-specific requirements like CLS and SEP tokens for BERT, sentence pair formatting, and type IDs.

Post-processing occurs after tokenization but before returning results to the user. The typical flow is: 1. Core tokenization produces token IDs and strings 2. Post-processor adds special tokens (e.g., CLS, SEP) 3. Post-processor sets type IDs and attention masks 4. Result is final encoding ready for model input

Sourcetype encoding = {
  1. ids : int array;
  2. type_ids : int array;
  3. tokens : string array;
  4. offsets : (int * int) array;
  5. special_tokens_mask : int array;
  6. attention_mask : int array;
  7. overflowing : encoding list;
  8. sequence_ranges : (int * int * int) list;
}

Encoding representation for post-processing.

Contains all information needed for model input: token IDs, type IDs (segment IDs), token strings, character offsets, special token mask, attention mask, overflowing tokens (from truncation), and sequence range markers.

Sourcetype t

Post-processor that adds special tokens and formatting to encodings.

Processor Types

Sourceval bert : sep:(string * int) -> cls:(string * int) -> unit -> t

bert ~sep ~cls () creates BERT-style post-processor.

Formats sequences as: [CLS] sequence [SEP] Formats pairs as: [CLS] sequence_a [SEP] sequence_b [SEP]

Sets type IDs: 0 for first sequence (including CLS and first SEP), 1 for second sequence.

  • parameter sep

    Separator token and ID (typically ("SEP", 102)).

  • parameter cls

    Classification token and ID (typically ("CLS", 101)).

Sourceval roberta : sep:(string * int) -> cls:(string * int) -> ?trim_offsets:bool -> ?add_prefix_space:bool -> unit -> t

roberta ~sep ~cls ?trim_offsets ?add_prefix_space () creates RoBERTa-style post-processor.

Similar to BERT but with different special token placement: Formats sequences as: <s> sequence </s> Formats pairs as: <s> sequence_a </s> </s> sequence_b </s>

  • parameter sep

    Separator/end token and ID (typically ("</s>", 2)).

  • parameter cls

    Start token and ID (typically ("<s>", 0)).

  • parameter trim_offsets

    Adjust offsets for byte-level tokenization (default: true).

  • parameter add_prefix_space

    Whether prefix space handling is enabled (default: true).

Sourceval byte_level : ?trim_offsets:bool -> unit -> t

byte_level ?trim_offsets () creates byte-level post-processor.

Adjusts character offsets to account for byte-level encoding transformations.

  • parameter trim_offsets

    Remove leading/trailing spaces from offsets (default: true).

Sourceval template : single:string -> ?pair:string -> ?special_tokens:(string * int) list -> unit -> t

template ~single ?pair ?special_tokens () creates template-based post-processor.

Flexible processor using templates to define special token placement. Templates use placeholders:

  • $A: First sequence
  • $B: Second sequence (for pairs)
  • Special tokens by name (e.g., "CLS")
  • parameter single

    Template for single sequences (e.g., "CLS $A SEP").

  • parameter pair

    Template for sequence pairs (e.g., "CLS $A SEP $B SEP"). If not provided, pairs are rejected.

  • parameter special_tokens

    List of (token_string, token_id) pairs for special tokens used in templates.

Sourceval sequence : t list -> t

sequence processors chains multiple post-processors.

Applies processors left-to-right. Each processor modifies the encoding before passing to next. Useful for combining transformations.

Operations

Sourceval process : t -> encoding list -> add_special_tokens:bool -> encoding list

process processor encodings ~add_special_tokens applies post-processing.

Adds special tokens, sets type IDs, and updates masks according to processor configuration.

  • parameter processor

    Post-processor to apply.

  • parameter encodings

    List of encodings (typically one for single sequence, two for pairs).

  • parameter add_special_tokens

    Whether to add special tokens (allows disabling).

  • returns

    Processed encodings with special tokens and updated fields.

Sourceval added_tokens : t -> is_pair:bool -> int

added_tokens processor ~is_pair counts special tokens added by processor.

Useful for calculating maximum sequence length before truncation.

  • parameter processor

    Post-processor to query.

  • parameter is_pair

    Whether processing a pair (affects token count).

  • returns

    Number of special tokens that will be added.

Serialization

Sourceval to_json : t -> Yojson.Basic.t

to_json processor serializes processor to HuggingFace JSON format.

Sourceval of_json : Yojson.Basic.t -> t

of_json json deserializes processor from HuggingFace JSON format.