Module Saga_tokenizers.Decoders Source Decoding token IDs back to text.
Decoding tokens back to text.
Decoders reverse the tokenization process, converting token strings back into natural text. They handle removing special markers (prefixes, suffixes), reversing byte-level encoding, normalizing whitespace, and other post-processing needed to reconstruct readable text.
Decoders operate on token strings (not IDs). The typical flow is: 1. Convert IDs to token strings via vocabulary 2. Apply decoder to token string list 3. Result is final decoded text
Multiple decoders can be chained with sequence to compose transformations.
Decoder that transforms token strings into natural text.
Decoders are composable and can be chained.
Decoder TypesSource val bpe : ?suffix :string -> unit -> t bpe ?suffix () creates BPE decoder.
Removes end-of-word suffixes added during tokenization.
byte_level () creates decoder for byte-level tokenization.
Reverses byte-to-Unicode encoding used by GPT-2 style tokenizers. Converts special byte representations back to original characters.
Source val byte_fallback : unit -> t byte_fallback () creates decoder for byte fallback encoding.
Converts byte tokens (e.g., "<0x41>") back to characters.
Source val wordpiece : ?prefix :string -> ?cleanup :bool -> unit -> t wordpiece ?prefix ?cleanup () creates WordPiece decoder.
Removes continuing subword prefixes and merges tokens into words.
metaspace ?replacement ?add_prefix_space () creates metaspace decoder.
Converts metaspace markers back to regular spaces.
Source val ctc :
?pad_token :string ->
?word_delimiter_token :string ->
?cleanup :bool ->
unit ->
t ctc ?pad_token ?word_delimiter_token ?cleanup () creates CTC decoder for speech recognition models.
Removes CTC blank tokens and formats word boundaries.
sequence decoders chains multiple decoders.
Applies decoders left-to-right. Output of each decoder feeds into next. Useful for combining transformations (e.g., byte-level + wordpiece + whitespace cleanup).
Source val replace : pattern :string -> content :string -> unit -> t replace ~pattern ~content () creates pattern replacement decoder.
Replaces all occurrences of pattern with content in decoded text. Uses literal string matching (not regex).
Source val strip : ?left :bool -> ?right :bool -> ?content :char -> unit -> t strip ?left ?right ?content () creates whitespace stripping decoder.
Removes specified characters from text edges.
fuse () creates decoder that merges all tokens without delimiters.
Concatenates token strings with no spaces. Useful when tokens already contain appropriate spacing.
OperationsSource val decode : t -> string list -> stringdecode decoder tokens converts token strings to text.
Applies decoder transformations to reconstruct natural text from token list.
Serializationto_json decoder serializes decoder to HuggingFace JSON format.
of_json json deserializes decoder from HuggingFace JSON format.