package saga

  1. Overview
  2. Docs
Text processing and NLP extensions for Nx

Install

dune-project
 Dependency

Authors

Maintainers

Sources

raven-1.0.0.alpha2.tbz
sha256=93abc49d075a1754442ccf495645bc4fdc83e4c66391ec8aca8fa15d2b4f44d2
sha512=5eb958c51f30ae46abded4c96f48d1825f79c7ce03f975f9a6237cdfed0d62c0b4a0774296694def391573d849d1f869919c49008acffca95946b818ad325f6f

doc/saga.tokenizers/Saga_tokenizers/Decoders/index.html

Module Saga_tokenizers.DecodersSource

Decoding token IDs back to text.

Decoding tokens back to text.

Decoders reverse the tokenization process, converting token strings back into natural text. They handle removing special markers (prefixes, suffixes), reversing byte-level encoding, normalizing whitespace, and other post-processing needed to reconstruct readable text.

Decoders operate on token strings (not IDs). The typical flow is: 1. Convert IDs to token strings via vocabulary 2. Apply decoder to token string list 3. Result is final decoded text

Multiple decoders can be chained with sequence to compose transformations.

Sourcetype t

Decoder that transforms token strings into natural text.

Decoders are composable and can be chained.

Decoder Types

Sourceval bpe : ?suffix:string -> unit -> t

bpe ?suffix () creates BPE decoder.

Removes end-of-word suffixes added during tokenization.

  • parameter suffix

    Suffix to strip from tokens (default: empty string).

Sourceval byte_level : unit -> t

byte_level () creates decoder for byte-level tokenization.

Reverses byte-to-Unicode encoding used by GPT-2 style tokenizers. Converts special byte representations back to original characters.

Sourceval byte_fallback : unit -> t

byte_fallback () creates decoder for byte fallback encoding.

Converts byte tokens (e.g., "<0x41>") back to characters.

Sourceval wordpiece : ?prefix:string -> ?cleanup:bool -> unit -> t

wordpiece ?prefix ?cleanup () creates WordPiece decoder.

Removes continuing subword prefixes and merges tokens into words.

  • parameter prefix

    Prefix to remove from non-initial subwords (default: "##").

  • parameter cleanup

    Normalize whitespace and remove artifacts (default: true).

Sourceval metaspace : ?replacement:char -> ?add_prefix_space:bool -> unit -> t

metaspace ?replacement ?add_prefix_space () creates metaspace decoder.

Converts metaspace markers back to regular spaces.

  • parameter replacement

    Metaspace character used during tokenization (default: '▁').

  • parameter add_prefix_space

    Whether tokenizer added prefix space (affects decoding) (default: true).

Sourceval ctc : ?pad_token:string -> ?word_delimiter_token:string -> ?cleanup:bool -> unit -> t

ctc ?pad_token ?word_delimiter_token ?cleanup () creates CTC decoder for speech recognition models.

Removes CTC blank tokens and formats word boundaries.

  • parameter pad_token

    Padding token to remove (default: "<pad>").

  • parameter word_delimiter_token

    Word boundary marker (default: "|").

  • parameter cleanup

    Remove extra whitespace and artifacts (default: true).

Sourceval sequence : t list -> t

sequence decoders chains multiple decoders.

Applies decoders left-to-right. Output of each decoder feeds into next. Useful for combining transformations (e.g., byte-level + wordpiece + whitespace cleanup).

Sourceval replace : pattern:string -> content:string -> unit -> t

replace ~pattern ~content () creates pattern replacement decoder.

Replaces all occurrences of pattern with content in decoded text. Uses literal string matching (not regex).

  • parameter pattern

    String to find.

  • parameter content

    Replacement string.

Sourceval strip : ?left:bool -> ?right:bool -> ?content:char -> unit -> t

strip ?left ?right ?content () creates whitespace stripping decoder.

Removes specified characters from text edges.

  • parameter left

    Strip from start of text (default: false).

  • parameter right

    Strip from end of text (default: false).

  • parameter content

    Character to strip (default: space ' ').

Sourceval fuse : unit -> t

fuse () creates decoder that merges all tokens without delimiters.

Concatenates token strings with no spaces. Useful when tokens already contain appropriate spacing.

Operations

Sourceval decode : t -> string list -> string

decode decoder tokens converts token strings to text.

Applies decoder transformations to reconstruct natural text from token list.

  • parameter decoder

    Decoder to apply.

  • parameter tokens

    List of token strings (not IDs).

  • returns

    Decoded text.

Serialization

Sourceval to_json : t -> Yojson.Basic.t

to_json decoder serializes decoder to HuggingFace JSON format.

Sourceval of_json : Yojson.Basic.t -> t

of_json json deserializes decoder from HuggingFace JSON format.