package saga

  1. Overview
  2. Docs
Text processing and NLP extensions for Nx

Install

dune-project
 Dependency

Authors

Maintainers

Sources

raven-1.0.0.alpha1.tbz
sha256=8e277ed56615d388bc69c4333e43d1acd112b5f2d5d352e2453aef223ff59867
sha512=369eda6df6b84b08f92c8957954d107058fb8d3d8374082e074b56f3a139351b3ae6e3a99f2d4a4a2930dd950fd609593467e502368a13ad6217b571382da28c

doc/saga.tokenizers/Saga_tokenizers/Bpe/index.html

Module Saga_tokenizers.BpeSource

Byte Pair Encoding (BPE) tokenization module

Core Types

Sourcetype t

BPE model

Sourcetype vocab = (string, int) Hashtbl.t

Vocabulary mapping tokens to indices

Sourcetype merges = (string * string) list

List of merge operations

Sourcetype config = {
  1. vocab : vocab;
  2. merges : merges;
  3. cache_capacity : int;
  4. dropout : float option;
  5. unk_token : string option;
  6. continuing_subword_prefix : string option;
  7. end_of_word_suffix : string option;
  8. fuse_unk : bool;
  9. byte_fallback : bool;
  10. ignore_merges : bool;
}

BPE configuration

Model Creation

Sourceval create : config -> t

create config creates a new BPE model with the given configuration

Sourceval from_files : vocab_file:string -> merges_file:string -> t

from_files ~vocab_file ~merges_file loads a BPE model from vocab.json and merges.txt files

Sourceval default : unit -> t

default () creates a BPE model with default configuration

Configuration Builder

Sourcemodule Builder : sig ... end

Tokenization

Sourcetype token = {
  1. id : int;
  2. value : string;
  3. offsets : int * int;
}

Token with ID, string value, and character offsets

Sourceval tokenize : t -> string -> token list

tokenize model text tokenizes text into tokens

Sourceval token_to_id : t -> string -> int option

token_to_id model token returns the ID for a token

Sourceval id_to_token : t -> int -> string option

id_to_token model id returns the token for an ID

Vocabulary Management

Sourceval get_vocab : t -> (string * int) list

get_vocab model returns the vocabulary as a list of (token, id) pairs

Sourceval get_vocab_size : t -> int

get_vocab_size model returns the size of the vocabulary

Sourceval get_unk_token : t -> string option

get_unk_token model returns the unknown token if configured

Sourceval get_continuing_subword_prefix : t -> string option

get_continuing_subword_prefix model returns the continuing subword prefix if configured

Sourceval get_end_of_word_suffix : t -> string option

get_end_of_word_suffix model returns the end-of-word suffix if configured

Cache Management

Sourceval clear_cache : t -> unit

clear_cache model clears the tokenization cache

Sourceval resize_cache : t -> int -> unit

resize_cache model capacity resizes the cache

Serialization

Sourceval save : t -> path:string -> ?name:string -> unit -> unit

save model ~path ?name () saves the model to vocab.json and merges.txt files

Sourceval read_files : vocab_file:string -> merges_file:string -> vocab * merges

read_files ~vocab_file ~merges_file reads vocabulary and merges from files

Training

Sourcemodule Trainer : sig ... end