package saga

  1. Overview
  2. Docs
Text processing and NLP extensions for Nx

Install

dune-project
 Dependency

Authors

Maintainers

Sources

raven-1.0.0.alpha1.tbz
sha256=8e277ed56615d388bc69c4333e43d1acd112b5f2d5d352e2453aef223ff59867
sha512=369eda6df6b84b08f92c8957954d107058fb8d3d8374082e074b56f3a139351b3ae6e3a99f2d4a4a2930dd950fd609593467e502368a13ad6217b571382da28c

doc/saga.tokenizers/Saga_tokenizers/Decoders/index.html

Module Saga_tokenizers.DecodersSource

Decoding module for converting token IDs back to text.

Sourcetype t

Main decoder type

Constructors

Sourceval bpe : ?suffix:string -> unit -> t

Create a BPE decoder.

  • parameter suffix

    Suffix to remove (default: "")

Sourceval byte_level : unit -> t

Create a byte-level decoder

Sourceval byte_fallback : unit -> t

Create a byte fallback decoder

Sourceval wordpiece : ?prefix:string -> ?cleanup:bool -> unit -> t

Create a WordPiece decoder.

  • parameter prefix

    Prefix to remove (default: "##")

  • parameter cleanup

    Whether to cleanup tokenization artifacts (default: true)

Sourceval metaspace : ?replacement:char -> ?add_prefix_space:bool -> unit -> t

Create a Metaspace decoder.

  • parameter replacement

    Character to replace spaces with (default: '▁')

  • parameter add_prefix_space

    Whether prefix space was added (default: true)

Sourceval ctc : ?pad_token:string -> ?word_delimiter_token:string -> ?cleanup:bool -> unit -> t

Create a CTC decoder.

  • parameter pad_token

    Padding token (default: "<pad>")

  • parameter word_delimiter_token

    Word delimiter token (default: "|")

  • parameter cleanup

    Whether to cleanup artifacts (default: true)

Sourceval sequence : t list -> t

Combine multiple decoders in sequence

Sourceval replace : pattern:string -> content:string -> unit -> t

Create a replace decoder.

  • parameter pattern

    Pattern to match

  • parameter content

    Replacement string

Sourceval strip : ?left:bool -> ?right:bool -> ?content:char -> unit -> t

Create a strip decoder.

  • parameter left

    Strip from left (default: false)

  • parameter right

    Strip from right (default: false)

  • parameter content

    Character to strip (default: ' ')

Sourceval fuse : unit -> t

Create a fuse decoder that merges tokens

Operations

Sourceval decode : t -> string list -> string

Decode a list of tokens back to text

Serialization

Sourceval to_json : t -> Yojson.Basic.t
Sourceval of_json : Yojson.Basic.t -> t