package saga
Install
dune-project
Dependency
Authors
Maintainers
Sources
sha256=93abc49d075a1754442ccf495645bc4fdc83e4c66391ec8aca8fa15d2b4f44d2
sha512=5eb958c51f30ae46abded4c96f48d1825f79c7ce03f975f9a6237cdfed0d62c0b4a0774296694def391573d849d1f869919c49008acffca95946b818ad325f6f
doc/saga.tokenizers/Saga_tokenizers/Encoding/index.html
Module Saga_tokenizers.EncodingSource
Encoding representation (output of tokenization).
Tokenization output representation.
Encodings are the result of tokenization, containing all information needed for model input: token IDs, type IDs, token strings, character offsets, attention masks, and metadata for alignment and debugging.
This module provides both construction (for internal use by tokenizers) and access methods (for users extracting information from tokenized text).
Encoding representing tokenized text.
Contains:
- Token IDs for model input
- Type IDs (segment IDs) for distinguishing sequences
- Token strings for debugging and display
- Character offsets for alignment with original text
- Special token mask identifying special tokens
- Attention mask for padding
- Overflowing tokens from truncation
- Sequence ranges for multi-sequence inputs
val create :
ids:int array ->
type_ids:int array ->
tokens:string array ->
words:int option array ->
offsets:(int * int) array ->
special_tokens_mask:int array ->
attention_mask:int array ->
overflowing:t list ->
sequence_ranges:(int, int * int) Hashtbl.t ->
tcreate ~ids ~type_ids ~tokens ~words ~offsets ~special_tokens_mask ~attention_mask ~overflowing ~sequence_ranges constructs encoding.
For internal use by tokenizers. Most users should obtain encodings via Saga_tokenizers.Tokenizer.encode.
with_capacity capacity creates empty encoding with preallocated capacity.
For internal use during encoding construction.
from_tokens tokens ~type_id creates encoding from token list.
Useful for testing or simple cases. Sets all tokens to same type_id.
is_empty encoding checks if encoding contains tokens.
Returns true if token array is empty.
length encoding returns number of tokens.
Includes special tokens added by post-processing.
n_sequences encoding returns number of input sequences.
1 for single sequence, 2 for pairs.
set_sequence_id encoding id assigns sequence ID to all tokens.
For internal use when constructing encodings. Returns new encoding.
Accessors
get_ids encoding retrieves token IDs.
These are the primary model inputs.
get_type_ids encoding retrieves type IDs (segment IDs).
Used to distinguish sequences in models like BERT. Typically 0 for first sequence, 1 for second sequence.
set_type_ids encoding type_ids replaces type IDs.
Returns new encoding with updated type IDs. Array length must match token count.
get_tokens encoding retrieves token strings.
Useful for debugging and displaying tokenization results.
get_word_ids encoding retrieves word IDs.
Maps each token to its source word index in original text. None indicates special tokens. Useful for word-level alignment.
get_sequence_ids encoding retrieves sequence ID for each token.
0 for first sequence, 1 for second sequence (in pairs). None for special tokens not belonging to either sequence.
get_offsets encoding retrieves character offsets.
Each tuple (start, end) indicates token's span in original text. Offsets are character-based (not byte-based).
get_special_tokens_mask encoding retrieves special token mask.
1 indicates special token (e.g., CLS, SEP), 0 indicates regular token. Useful for filtering special tokens in processing.
get_attention_mask encoding retrieves attention mask.
1 indicates real token, 0 indicates padding. Used in model attention mechanisms to ignore padding.
get_overflowing encoding retrieves overflowing tokens.
When truncation is enabled, tokens exceeding max length are stored here as separate encodings. Empty list if no truncation occurred.
set_overflowing encoding overflowing replaces overflowing encodings.
Returns new encoding with updated overflowing list.
take_overflowing encoding extracts and removes overflowing encodings.
Returns (encoding without overflowing, overflowing list). Useful for processing overflowing tokens separately.
Alignment and Mapping
token_to_sequence encoding token_index finds sequence containing token.
Returns Some 0 for first sequence, Some 1 for second sequence, None for special tokens or out of bounds.
token_to_word encoding token_index finds word containing token.
Returns Some (sequence_id, word_index) or None for special tokens.
token_to_chars encoding token_index retrieves token's character span.
Returns Some (sequence_id, (start, end)) where offsets are relative to that sequence, or None for special tokens.
word_to_tokens encoding ~word ~sequence_id finds tokens for word.
Returns Some (start_token, end_token) (exclusive end), or None if word not found.
word_to_chars encoding ~word ~sequence_id finds character span for word.
Returns Some (start, end) or None if word not found.
char_to_token encoding ~pos ~sequence_id finds token at character position.
Returns token index or None if position is outside tokens (e.g., in whitespace).
char_to_word encoding ~pos ~sequence_id finds word at character position.
Returns word index or None if position is outside words.
Operations
truncate encoding ~max_length ~stride ~direction limits encoding length.
Tokens beyond max_length are moved to overflowing encodings.
(* Example: Process long document with overlapping windows *)
let encoding = Tokenizer.encode tokenizer long_text in
let truncated = Encoding.truncate encoding
~max_length:512 ~stride:128 ~direction:Right in
let main_tokens = Encoding.get_ids truncated in
let overflow_encodings = Encoding.get_overflowing truncated in
(* main_tokens: [0..511], first overflow: [384..895], etc. *)merge encodings ~growing_offsets combines encodings into one.
Concatenates token arrays and adjusts metadata.
merge_with encoding other ~growing_offsets merges two encodings.
Similar to merge but for exactly two encodings. Returns new encoding.
val pad :
t ->
target_length:int ->
pad_id:int ->
pad_type_id:int ->
pad_token:string ->
direction:padding_direction ->
tpad encoding ~target_length ~pad_id ~pad_type_id ~pad_token ~direction extends encoding to target length.
Adds padding tokens until length reaches target_length. Pads attention mask with zeros for padding positions.