package saga
Install
dune-project
Dependency
Authors
Maintainers
Sources
sha256=93abc49d075a1754442ccf495645bc4fdc83e4c66391ec8aca8fa15d2b4f44d2
sha512=5eb958c51f30ae46abded4c96f48d1825f79c7ce03f975f9a6237cdfed0d62c0b4a0774296694def391573d849d1f869919c49008acffca95946b818ad325f6f
doc/saga.tokenizers/Saga_tokenizers/Decoders/index.html
Module Saga_tokenizers.DecodersSource
Decoding token IDs back to text.
Decoding tokens back to text.
Decoders reverse the tokenization process, converting token strings back into natural text. They handle removing special markers (prefixes, suffixes), reversing byte-level encoding, normalizing whitespace, and other post-processing needed to reconstruct readable text.
Decoders operate on token strings (not IDs). The typical flow is: 1. Convert IDs to token strings via vocabulary 2. Apply decoder to token string list 3. Result is final decoded text
Multiple decoders can be chained with sequence to compose transformations.
Decoder that transforms token strings into natural text.
Decoders are composable and can be chained.
Decoder Types
bpe ?suffix () creates BPE decoder.
Removes end-of-word suffixes added during tokenization.
byte_level () creates decoder for byte-level tokenization.
Reverses byte-to-Unicode encoding used by GPT-2 style tokenizers. Converts special byte representations back to original characters.
byte_fallback () creates decoder for byte fallback encoding.
Converts byte tokens (e.g., "<0x41>") back to characters.
wordpiece ?prefix ?cleanup () creates WordPiece decoder.
Removes continuing subword prefixes and merges tokens into words.
metaspace ?replacement ?add_prefix_space () creates metaspace decoder.
Converts metaspace markers back to regular spaces.
ctc ?pad_token ?word_delimiter_token ?cleanup () creates CTC decoder for speech recognition models.
Removes CTC blank tokens and formats word boundaries.
sequence decoders chains multiple decoders.
Applies decoders left-to-right. Output of each decoder feeds into next. Useful for combining transformations (e.g., byte-level + wordpiece + whitespace cleanup).
replace ~pattern ~content () creates pattern replacement decoder.
Replaces all occurrences of pattern with content in decoded text. Uses literal string matching (not regex).
strip ?left ?right ?content () creates whitespace stripping decoder.
Removes specified characters from text edges.
fuse () creates decoder that merges all tokens without delimiters.
Concatenates token strings with no spaces. Useful when tokens already contain appropriate spacing.
Operations
decode decoder tokens converts token strings to text.
Applies decoder transformations to reconstruct natural text from token list.
Serialization
to_json decoder serializes decoder to HuggingFace JSON format.
of_json json deserializes decoder from HuggingFace JSON format.