Module `Saga_tokenizers.Decoders`Source

Decoding token IDs back to text.

Decoding tokens back to text.

Decoders reverse the tokenization process, converting token strings back into natural text. They handle removing special markers (prefixes, suffixes), reversing byte-level encoding, normalizing whitespace, and other post-processing needed to reconstruct readable text.

Decoders operate on token strings (not IDs). The typical flow is: 1. Convert IDs to token strings via vocabulary 2. Apply decoder to token string list 3. Result is final decoded text

Multiple decoders can be chained with sequence to compose transformations.

Sourcetype t

Decoder that transforms token strings into natural text.

Decoders are composable and can be chained.

Decoder Types

Sourceval bpe : ?suffix:string -> unit -> t

bpe ?suffix () creates BPE decoder.

Removes end-of-word suffixes added during tokenization.

parameter suffix
Suffix to strip from tokens (default: empty string).

Sourceval byte_level : unit -> t

byte_level () creates decoder for byte-level tokenization.

Reverses byte-to-Unicode encoding used by GPT-2 style tokenizers. Converts special byte representations back to original characters.

Sourceval byte_fallback : unit -> t

byte_fallback () creates decoder for byte fallback encoding.

Converts byte tokens (e.g., "<0x41>") back to characters.

Sourceval wordpiece : ?prefix:string -> ?cleanup:bool -> unit -> t

wordpiece ?prefix ?cleanup () creates WordPiece decoder.

Removes continuing subword prefixes and merges tokens into words.

parameter prefix
Prefix to remove from non-initial subwords (default: "##").

parameter cleanup
Normalize whitespace and remove artifacts (default: true).

Sourceval metaspace : ?replacement:char -> ?add_prefix_space:bool -> unit -> t

metaspace ?replacement ?add_prefix_space () creates metaspace decoder.

Converts metaspace markers back to regular spaces.

parameter replacement
Metaspace character used during tokenization (default: '▁').

parameter add_prefix_space
Whether tokenizer added prefix space (affects decoding) (default: true).

Source

val ctc : 
  ?pad_token:string ->
  ?word_delimiter_token:string ->
  ?cleanup:bool ->
  unit ->
  t

ctc ?pad_token ?word_delimiter_token ?cleanup () creates CTC decoder for speech recognition models.

Removes CTC blank tokens and formats word boundaries.

parameter pad_token
Padding token to remove (default: "<pad>").

parameter word_delimiter_token
Word boundary marker (default: "|").

parameter cleanup
Remove extra whitespace and artifacts (default: true).

Sourceval sequence : t list -> t

sequence decoders chains multiple decoders.

Applies decoders left-to-right. Output of each decoder feeds into next. Useful for combining transformations (e.g., byte-level + wordpiece + whitespace cleanup).

Sourceval replace : pattern:string -> content:string -> unit -> t

replace ~pattern ~content () creates pattern replacement decoder.

Replaces all occurrences of pattern with content in decoded text. Uses literal string matching (not regex).

parameter pattern
String to find.

parameter content
Replacement string.

Sourceval strip : ?left:bool -> ?right:bool -> ?content:char -> unit -> t

strip ?left ?right ?content () creates whitespace stripping decoder.

Removes specified characters from text edges.

parameter left
Strip from start of text (default: false).

parameter right
Strip from end of text (default: false).

parameter content
Character to strip (default: space ' ').

Sourceval fuse : unit -> t

fuse () creates decoder that merges all tokens without delimiters.

Concatenates token strings with no spaces. Useful when tokens already contain appropriate spacing.

Operations

Sourceval decode : t -> string list -> string

decode decoder tokens converts token strings to text.

Applies decoder transformations to reconstruct natural text from token list.

parameter decoder
Decoder to apply.

parameter tokens
List of token strings (not IDs).

returns
Decoded text.

Serialization

Sourceval to_json : t -> Yojson.Basic.t

to_json decoder serializes decoder to HuggingFace JSON format.

Sourceval of_json : Yojson.Basic.t -> t

of_json json deserializes decoder from HuggingFace JSON format.

raises Yojson.Json_error
if JSON is malformed.

Install

dune-project
Dependency

Authors

Maintainers

Sources

doc/saga.tokenizers/Saga_tokenizers/Decoders/index.html

Module `Saga_tokenizers.Decoders`Source

Decoder Types

Operations

Serialization

package saga

Install

dune-project Dependency

Authors

Maintainers

Sources

doc/saga.tokenizers/Saga_tokenizers/Decoders/index.html

Module Saga_tokenizers.DecodersSource

Decoder Types

Operations

Serialization

dune-project
Dependency

Module `Saga_tokenizers.Decoders`Source