package saga
Install
dune-project
Dependency
Authors
Maintainers
Sources
sha256=93abc49d075a1754442ccf495645bc4fdc83e4c66391ec8aca8fa15d2b4f44d2
sha512=5eb958c51f30ae46abded4c96f48d1825f79c7ce03f975f9a6237cdfed0d62c0b4a0774296694def391573d849d1f869919c49008acffca95946b818ad325f6f
doc/saga.tokenizers/Saga_tokenizers/Processors/index.html
Module Saga_tokenizers.ProcessorsSource
Post-processing (adding CLS/SEP, setting type IDs, etc.).
Post-processing tokenization output with special tokens.
Post-processors add special tokens and formatting to tokenized sequences after the core tokenization step. They handle model-specific requirements like CLS and SEP tokens for BERT, sentence pair formatting, and type IDs.
Post-processing occurs after tokenization but before returning results to the user. The typical flow is: 1. Core tokenization produces token IDs and strings 2. Post-processor adds special tokens (e.g., CLS, SEP) 3. Post-processor sets type IDs and attention masks 4. Result is final encoding ready for model input
type encoding = {ids : int array;type_ids : int array;tokens : string array;offsets : (int * int) array;special_tokens_mask : int array;attention_mask : int array;overflowing : encoding list;sequence_ranges : (int * int * int) list;
}Encoding representation for post-processing.
Contains all information needed for model input: token IDs, type IDs (segment IDs), token strings, character offsets, special token mask, attention mask, overflowing tokens (from truncation), and sequence range markers.
Post-processor that adds special tokens and formatting to encodings.
Processor Types
bert ~sep ~cls () creates BERT-style post-processor.
Formats sequences as: [CLS] sequence [SEP] Formats pairs as: [CLS] sequence_a [SEP] sequence_b [SEP]
Sets type IDs: 0 for first sequence (including CLS and first SEP), 1 for second sequence.
val roberta :
sep:(string * int) ->
cls:(string * int) ->
?trim_offsets:bool ->
?add_prefix_space:bool ->
unit ->
troberta ~sep ~cls ?trim_offsets ?add_prefix_space () creates RoBERTa-style post-processor.
Similar to BERT but with different special token placement: Formats sequences as: <s> sequence </s> Formats pairs as: <s> sequence_a </s> </s> sequence_b </s>
byte_level ?trim_offsets () creates byte-level post-processor.
Adjusts character offsets to account for byte-level encoding transformations.
val template :
single:string ->
?pair:string ->
?special_tokens:(string * int) list ->
unit ->
ttemplate ~single ?pair ?special_tokens () creates template-based post-processor.
Flexible processor using templates to define special token placement. Templates use placeholders:
$A: First sequence$B: Second sequence (for pairs)- Special tokens by name (e.g., "
CLS")
sequence processors chains multiple post-processors.
Applies processors left-to-right. Each processor modifies the encoding before passing to next. Useful for combining transformations.
Operations
process processor encodings ~add_special_tokens applies post-processing.
Adds special tokens, sets type IDs, and updates masks according to processor configuration.
added_tokens processor ~is_pair counts special tokens added by processor.
Useful for calculating maximum sequence length before truncation.
Serialization
to_json processor serializes processor to HuggingFace JSON format.
of_json json deserializes processor from HuggingFace JSON format.