Post-processing (adding CLS/SEP, setting type IDs, etc.).
Post-processing tokenization output with special tokens.
Post-processors add special tokens and formatting to tokenized sequences after the core tokenization step. They handle model-specific requirements like CLS and SEP tokens for BERT, sentence pair formatting, and type IDs.
Post-processing occurs after tokenization but before returning results to the user. The typical flow is: 1. Core tokenization produces token IDs and strings 2. Post-processor adds special tokens (e.g., CLS, SEP) 3. Post-processor sets type IDs and attention masks 4. Result is final encoding ready for model input
Contains all information needed for model input: token IDs, type IDs (segment IDs), token strings, character offsets, special token mask, attention mask, overflowing tokens (from truncation), and sequence range markers.
roberta ~sep ~cls ?trim_offsets ?add_prefix_space () creates RoBERTa-style post-processor.
Similar to BERT but with different special token placement: Formats sequences as: <s> sequence </s> Formats pairs as: <s> sequence_a </s> </s> sequence_b </s>
parametersep
Separator/end token and ID (typically ("</s>", 2)).
parametercls
Start token and ID (typically ("<s>", 0)).
parametertrim_offsets
Adjust offsets for byte-level tokenization (default: true).
parameteradd_prefix_space
Whether prefix space handling is enabled (default: true).