package saga

  1. Overview
  2. Docs

Module Saga_models.NgramSource

N-gram language models (unigram, bigram, trigram)

N-gram language models for text generation

Types

Sourcetype t

An n-gram model

Sourcetype vocab_stats = {
  1. vocab_size : int;
  2. total_tokens : int;
  3. unique_ngrams : int;
}

Statistics about the trained model

N-gram

Sourcetype smoothing =
  1. | Add_k of float
  2. | Stupid_backoff of float

Smoothing strategies:

  • Add_k k: classic add-k (Laplace) smoothing
  • Stupid_backoff alpha: back off to lower orders scaled by alpha
Sourceval create : n:int -> ?smoothing:smoothing -> ?cache_capacity:int -> int array -> t

create ~n ?smoothing ?cache_capacity tokens builds a model with configurable smoothing and an optional logits cache.

Sourceval logits : t -> context:int array -> float array

logits model ~context returns log probabilities given context. Context length should be n-1 for an n-gram model.

Sourceval perplexity : t -> int array -> float

perplexity model tokens computes perplexity on test tokens

Sourceval log_prob : t -> int array -> float

log_prob model tokens returns the sum of log-probabilities of the observed tokens under the model.

Sourceval generate : t -> ?max_tokens:int -> ?temperature:float -> ?seed:int -> ?start:int array -> unit -> int array

generate model ?max_tokens ?temperature ?seed ?start () generates tokens

Sourceval stats : t -> vocab_stats

stats model returns statistics about the highest-order n-grams.

Sourceval save : t -> string -> unit

save model filename serializes the model to a file.

Sourceval load : string -> t

load filename deserializes the model from a file.

Sourceval save_text : t -> string -> unit

save_text model filename serializes the model to a text file.

Sourceval load_text : string -> t

load_text filename deserializes the model from a text file.

Sourceval n : t -> int

n model returns the n-gram order of the model.