Page
Library
Module
Module type
Parameter
Class
Class type
Source
Kaun.DatasetEfficient dataset handling for machine learning pipelines This module provides composable dataset transformations with support for:
All datasets are unified under the polymorphic 'a t type, with specializations via type aliases where helpful (e.g., for tensors). Text handling uses string t directly for better composability.
A dataset of elements of type 'a. Datasets are lazy, composable, and abstract. Use creation functions to build them and transformations to modify.
Generalized dataset of tensors, parameterized over element, kind, and device
type element_spec = | Unknown| Scalar of stringe.g., "string" or "int"
*)| Tensor of int array * stringshape * dtype
*)| Tuple of element_spec list| Array of element_specStructured description of dataset element types, similar to TF's element_spec. Use for type-safe downstream processing.
*)val whitespace_tokenizer : tokenizerBuilt-in whitespace tokenizer
val from_array : 'a array -> 'a tfrom_array arr creates a dataset from an in-memory array
val from_list : 'a list -> 'a tfrom_list lst creates a dataset from a list
from_tensor tensor creates a dataset where each element is a slice of the first dimension
val from_tensors :
(('elt, 'kind) Rune.t * ('elt, 'kind) Rune.t) ->
(('elt, 'kind) Rune.t * ('elt, 'kind) Rune.t) tfrom_tensors (x, y) creates a dataset of (input, target) pairs
val from_file : (string -> 'a) -> string -> 'a tfrom_file parser path creates a dataset from a file, parsing each line with parser
val from_text_file :
?encoding:[ `UTF8 | `ASCII ] ->
?chunk_size:int ->
string ->
string tfrom_text_file ?encoding ?chunk_size path creates a memory-mapped text dataset yielding lines as strings.
encoding: Text encoding (default: UTF8)chunk_size: Size of chunks to read at once (default: 64KB) The file is memory-mapped and read lazily in chunks.val from_text_files :
?encoding:[ `UTF8 | `ASCII ] ->
?chunk_size:int ->
string list ->
string tfrom_text_files paths creates a dataset from multiple text files. Files are processed sequentially without loading all into memory.
val from_jsonl : ?field:string -> string -> string tfrom_jsonl ?field path reads a JSONL file where each line is a JSON object.
field: Extract text from this field (default: "text") Example JSONL format:
{"text": "First document", "label": 0}
{"text": "Second document", "label": 1}val sliding_window :
block_size:int ->
tokenize:(string -> int list) ->
string list ->
((float, Rune.float32_elt) Rune.t * (float, Rune.float32_elt) Rune.t) tsliding_window ~block_size ~tokenize texts creates a dataset of sliding window context/target pairs for language modeling.
Creates all possible sliding windows of size block_size from the input texts, where each window predicts the next token. Automatically handles padding with a special token.
Example:
let dataset =
sliding_window ~block_size:3
~tokenize:(fun s -> encode_chars ~vocab s)
[ "hello"; "world" ]
(* Generates windows like: "...h" -> "e" "..he" -> "l" ".hel" -> "l"
"hell" -> "o" etc. *)val from_csv :
?separator:char ->
?text_column:int ->
?label_column:int option ->
?has_header:bool ->
string ->
string tfrom_csv ?separator ?text_column ?label_column ?has_header path reads CSV data.
separator: Field separator (default: ',')text_column: Column index for text (default: 0)label_column: Optional column index for labelshas_header: Skip first row if true (default: true)from_text ~tokenizer path reads a text file and returns a dataset of token ID arrays. The file is read as a single document and tokenized. This is useful for language modeling tasks where you want the entire document as a sequence of tokens.
zip ds1 ds2 pairs corresponding elements. Stops at shorter dataset.
interleave datasets alternates between datasets in round-robin fashion
val tokenize :
tokenizer ->
?max_length:int ->
?padding:[ `None | `Max of int | `Dynamic ] ->
?truncation:bool ->
?add_special_tokens:bool ->
string t ->
int array ttokenize tokenizer ?max_length ?padding ?truncation dataset tokenizes text data using the provided tokenizer.
max_length: Maximum sequence lengthpadding: Padding strategytruncation: Whether to truncate long sequencesadd_special_tokens: Add <bos>, <eos> tokensval normalize :
?lowercase:bool ->
?remove_punctuation:bool ->
?collapse_whitespace:bool ->
string t ->
string tnormalize ?lowercase ?remove_punctuation ?collapse_whitespace dataset applies text normalization
val batch :
?drop_remainder:bool ->
int ->
((float, 'layout) Rune.t * (float, 'layout) Rune.t) t ->
((float, 'layout) Rune.t * (float, 'layout) Rune.t) tbatch ?drop_remainder size dataset groups tensor pairs into batches and automatically stacks them along the batch dimension.
drop_remainder: Drop final batch if incomplete (default: false)This is the primary batching function for ML workflows where datasets contain (input, target) tensor pairs. The tensors are automatically stacked using Rune.stack ~axis:0.
batch_map ?drop_remainder size f dataset groups elements into batches and applies function f to each batch.
This is useful for custom batching logic that can't be handled by batch or batch_array.
val bucket_by_length :
?boundaries:int list ->
?batch_sizes:int list ->
('a -> int) ->
'a t ->
'a array tbucket_by_length ?boundaries ?batch_sizes length_fn dataset groups elements into buckets by length for efficient padding. Example:
bucket_by_length ~boundaries:[ 10; 20; 30 ] ~batch_sizes:[ 32; 16; 8; 4 ]
(fun text -> String.length text)
datasetCreates 4 buckets: <10, 10-20, 20-30, >30 with different batch sizes
val shuffle : ?rng:Rune.Rng.key -> ?buffer_size:int -> 'a t -> 'a tshuffle ?rng ?buffer_size dataset randomly shuffles elements.
rng: Random state for reproducibility (default: self-init)buffer_size: Size of shuffle buffer (default: 10000) Uses a buffer to shuffle without loading entire dataset in memory.val sample : ?rng:Rune.Rng.key -> ?replacement:bool -> int -> 'a t -> 'a tsample ?rng ?replacement n dataset randomly samples n elements
val weighted_sample :
?rng:Rune.Rng.key ->
weights:float array ->
int ->
'a t ->
'a tweighted_sample ?rng ~weights n dataset samples with given weights
repeat ?count dataset repeats dataset. Infinite if count not specified.
window ?shift ?stride ?drop_remainder size dataset creates sliding windows.
shift: How many elements to shift window (default: size)stride: Stride within window (default: 1) Example: window ~shift:1 3 dataset creates overlapping windows of size 3cache ?directory dataset caches dataset elements.
directory: Directory for file cache, in-memory if not specifiedprefetch ?buffer_size dataset pre-fetches elements in background.
buffer_size: Number of elements to prefetch (default: 2) Uses a separate thread to prepare next elements while current is processed.parallel_map ?num_workers f dataset applies f using multiple workers.
num_workers: Number of parallel workers (default: CPU count)parallel_interleave ?num_workers ?block_length f dataset applies f in parallel and interleaves results
val prepare :
?shuffle_buffer:int ->
?batch_size:int ->
?prefetch:int ->
?cache:bool ->
?drop_remainder:bool ->
((float, 'layout) Rune.t * (float, 'layout) Rune.t) t ->
((float, 'layout) Rune.t * (float, 'layout) Rune.t) tprepare ?shuffle_buffer ?batch_size ?prefetch ?cache ?drop_remainder dataset applies common preprocessing pipeline for tensor datasets: 1. Cache (if enabled) 2. Shuffle (if buffer size provided) 3. Batch with automatic tensor stacking (if batch size provided) 4. Prefetch (if prefetch count provided)
This is the primary pipeline function for ML training data.
val iter : ('a -> unit) -> 'a t -> unititer f dataset applies f to each element for side effects
val fold : ('acc -> 'a -> 'acc) -> 'acc -> 'a t -> 'accfold f init dataset folds over dataset elements
val to_list : 'a t -> 'a listto_list dataset materializes dataset as list. Warning: loads all into memory.
val to_array : 'a t -> 'a arrayto_array dataset materializes dataset as array. Warning: loads all into memory.
val cardinality : 'a t -> cardinalitycardinality dataset returns the cardinality (finite length, unknown, or infinite)
val element_spec : 'a t -> element_specelement_spec dataset returns a structured description of element types
val reset : 'a t -> unitreset dataset resets the dataset to its initial state if supported. This makes it possible to iterate a dataset multiple times (e.g., across training epochs). If the dataset does not support reset, this is a no-op.
val text_classification_pipeline :
?tokenizer:tokenizer ->
?max_length:int ->
?batch_size:int ->
?shuffle_buffer:int ->
?num_workers:int ->
string t ->
(int32, Rune.int32_elt) Rune.t tPre-configured pipeline for text classification tasks. Returns batched token tensors ready for embedding layers.
val language_model_pipeline :
?tokenizer:tokenizer ->
?sequence_length:int ->
?batch_size:int ->
?shuffle_buffer:int ->
?num_workers:int ->
string t ->
((int32, Rune.int32_elt) Rune.t * (int32, Rune.int32_elt) Rune.t) tPre-configured pipeline for language modeling. Returns batched (input, target) tensor pairs ready for training.
(* Load and process text data *)
let dataset =
from_text_file "data/corpus.txt"
|> tokenize whitespace_tokenizer ~max_length:512
|> shuffle ~buffer_size:10000
|> batch 32
|> prefetch ~buffer_size:2
(* Iterate through batches *)
dataset
|> iter (fun batch ->
let tensor = process_batch batch in
train_step model tensor)
(* Multi-file dataset with bucketing *)
let dataset =
from_text_files [ "shard1.txt"; "shard2.txt"; "shard3.txt" ]
|> normalize ~lowercase:true
|> tokenize whitespace_tokenizer
|> bucket_by_length ~boundaries:[ 100; 200; 300 ]
~batch_sizes:[ 64; 32; 16; 8 ] Array.length
|> prefetch
(* Parallel processing *)
let dataset =
from_jsonl "data.jsonl"
|> parallel_map ~num_workers:4 preprocess
|> cache ~directory:"/tmp/cache"
|> shuffle ~buffer_size:50000
|> batch 128
(* Custom tokenizer and tensor batching *)
let custom_tok = fun s -> (* ... *) [|1;2;3|] in
let tensor_ds =
from_text_file "texts.txt"
|> tokenize custom_tok
|> batch_map 32 (Rune.stack ~axis:0)