package talon

  1. Overview
  2. Docs

Module TalonSource

Talon - A dataframe library for OCaml

Talon provides efficient tabular data manipulation with heterogeneous column types, inspired by pandas and polars. Built on top of the Nx tensor library, it offers type-safe operations with comprehensive null handling.

Dataframes are immutable collections of named columns with equal length. Each column can contain different types of data with explicit null semantics:

  • Numeric tensors (via Nx): float32/float64, int32/int64
  • String option arrays with explicit null support
  • Boolean option arrays with explicit null support

Key Concepts

Null Handling

Talon provides first-class null semantics with explicit null masks for numeric columns, ensuring accurate tracking of missing values:

Null Representation

  • Numeric columns: Optional boolean mask tracks null entries explicitly. When no mask exists, sentinel values (NaN for floats, Int32.min_int/Int64.min_int for integers) indicate nulls. The mask takes precedence when present.
  • String/Boolean columns: None values represent nulls explicitly

Creating Nullable Columns

Use the _opt constructors to create columns with explicit null support:

  (* Nullable numeric columns *)
  Col.float64_opt
    [| Some 1.0; None; Some 3.0 |]
    Col.int32_opt
    [| Some 42l; None; Some 100l |]
    (* String/bool columns preserve None directly *)
    Col.string_opt
    [| Some "hello"; None; Some "world" |]
    Col.bool_opt
    [| Some true; None; Some false |]

Accessing Values with Null Awareness

Use option-based accessors to distinguish nulls from sentinel values:

  (* Row-wise option accessors *)
  Row.float64_opt "score" (* Returns None for nulls *) Row.int32_opt
    "count" (* Distinguishes None from Int32.min_int *)
    (* Extract as option arrays *)
    to_float64_options df "score" (* float option array *)

Null Propagation

  • Operations propagate nulls: null + x = null
  • Aggregations skip nulls by default (configurable with skipna parameter)
  • Mask-aware aggregations properly exclude masked entries from computations

Type Safety

The library maintains type information through GADTs and provides type-specific aggregation modules (Agg.Float, Agg.Int, etc.) that ensure operations are only applied to compatible column types.

Performance

Operations leverage vectorized Nx tensor computations where possible. Row-wise operations use an applicative interface that compiles to efficient loops. Use with_columns_map for computing multiple columns in a single pass.

Quick Start

  open Talon

  (* Create a dataframe from columns *)
  let df =
    create
      [
        ("name", Col.string_list [ "Alice"; "Bob"; "Charlie" ]);
        ("age", Col.int32_list [ 25l; 30l; 35l ]);
        ("score", Col.float64_list [ 85.5; 92.0; 78.5 ]);
        ("active", Col.bool_list [ true; false; true ]);
      ]

  (* Filter rows where age > 25 *)
  let adults =
    filter_by df Row.(map (int32 "age") ~f:(fun age -> age > 25l))

  (* Aggregations - explicit about expected types *)
  let total_score =
    Agg.Float.sum df "score" (* 256.0 - works on any numeric *)

  let avg_age = Agg.Int.mean df "age" (* 30.0 - returns float *)
  let max_name = Agg.String.max df "name" (* Some "Charlie" *)

  (* Column operations preserve dtype *)
  let cumulative =
    Agg.cumsum df "score" (* Returns packed_column with float32 *)

  let age_diff = Agg.diff df "age" () (* Returns packed_column with int32 *)

  (* Extract column as array for external processing *)
  let scores_array = to_float32_array df "score"

  (* Group by a computed key *)
  let by_category =
    group_by df
      Row.(
        map (float32 "score") ~f:(fun s ->
            if s >= 90.0 then "A" else if s >= 80.0 then "B" else "C"))
Sourcetype t

Type of dataframe.

Dataframes are immutable tabular data structures with named, typed columns. All columns in a dataframe have the same length (number of rows).

Implementation: Internally uses a list of (name, column) pairs for ordering and a hash table for O(1) column lookup by name.

Sourcetype 'a row

Type for row-wise computations.

This abstract type represents a computation that can be applied to each row of a dataframe to produce a value of type 'a. Row computations are lazy and only executed when the dataframe is processed.

Row computations form an applicative functor, allowing composition of independent computations from multiple columns.

Column Operations

Columns are the fundamental data containers in Talon dataframes. Each column stores homogeneous data with consistent null handling.

Sourcemodule Col : sig ... end

Column creation and manipulation for heterogeneous data types.

DataFrame Creation

Functions for creating dataframes from various data sources.

Sourceval empty : t

empty creates an empty dataframe with no rows or columns.

This is the neutral element for operations like concat. Useful as a starting point for building dataframes incrementally.

Example:

  let df = empty in
  let df' = add_column df "first" (Col.int32 [| 1l; 2l |]) in
  assert (shape df' = (2, 1))
Sourceval create : (string * Col.t) list -> t

create pairs creates a dataframe from (column_name, column) pairs.

This is the primary constructor for dataframes. Each pair specifies a column name and its data.

Invariants:

  • Column names must be unique (case-sensitive)
  • All columns must have exactly the same length
  • Numeric columns must be 1D tensors (checked automatically by Col module)
  • parameter pairs

    List of (column_name, column_data) pairs

  • raises Invalid_argument

    if duplicate column names exist, column lengths differ, or any column has invalid structure.

Example:

  let df =
    create
      [
        ("name", Col.string [| "Alice"; "Bob" |]);
        ("age", Col.int32 [| 25l; 30l |]);
        ("score", Col.float64 [| 85.5; 92.0 |]);
      ]
  in
  assert (shape df = (2, 3))
Sourceval of_tensors : ?names:string list -> ('a, 'b) Nx.t list -> t

of_tensors ?names tensors creates dataframe from 1D Nx tensors.

All tensors must have the same shape and dtype. This is efficient for creating dataframes from pre-computed tensor data.

  • parameter names

    Column names (default: "col0", "col1", etc.)

  • parameter tensors

    List of 1D tensors with identical shapes and dtypes

  • raises Invalid_argument

    if tensors have inconsistent shapes, any tensor is not 1D, names are not unique, or wrong number of names provided.

Example:

  let t1 = Nx.create Nx.float64 [| 3 |] [| 1.0; 2.0; 3.0 |] in
  let t2 = Nx.create Nx.float64 [| 3 |] [| 4.0; 5.0; 6.0 |] in
  let df = of_tensors [ t1; t2 ] ~names:[ "x"; "y" ] in
  assert (shape df = (3, 2))
Sourceval from_nx : ?names:string list -> ('a, 'b) Nx.t -> t

from_nx ?names tensor creates dataframe from 2D tensor.

Each column of the tensor becomes a dataframe column. This is useful for converting tensor data from machine learning operations back to tabular format.

  • parameter names

    Column names (default: "col0", "col1", etc.)

  • parameter tensor

    2D tensor where rows are observations, columns are variables

Example:

  let data =
    Nx.create Nx.float64 [| 2; 3 |] [| 1.0; 2.0; 3.0; 4.0; 5.0; 6.0 |]
  in
  let df = from_nx data ~names:[ "x"; "y"; "z" ] in
  (* Result: 2 rows × 3 columns dataframe *)
  assert (shape df = (2, 3))

Shape and Inspection

Functions for examining dataframe structure and metadata.

Sourceval shape : t -> int * int

shape df returns (num_rows, num_columns).

This is the fundamental size information for the dataframe.

Time complexity: O(1) for non-empty dataframes.

Sourceval num_rows : t -> int

num_rows df returns number of rows.

Equivalent to fst (shape df) but more convenient when you only need row count.

Time complexity: O(1) for non-empty dataframes.

Sourceval num_columns : t -> int

num_columns df returns number of columns.

Equivalent to snd (shape df) but more convenient when you only need column count.

Time complexity: O(1).

Sourceval column_names : t -> string list

column_names df returns column names in their current order.

The order matches the column order for operations like print and to_nx.

Time complexity: O(k) where k is the number of columns.

Sourceval column_types : t -> (string * [ `Float32 | `Float64 | `Int32 | `Int64 | `Bool | `String | `Other ]) list

column_types df returns column names with their detected types.

Type detection:

  • `Float32, `Float64, `Int32, `Int64: Numeric Nx tensor columns
  • `Bool: Boolean option array columns
  • `String: String option array columns
  • `Other: Any other Nx tensor types (e.g., uint8)

Useful for programmatic dataframe inspection and type-based operations.

Time complexity: O(k) where k is the number of columns.

Sourceval is_empty : t -> bool

is_empty df returns true if dataframe has no rows.

Note that a dataframe can have columns but zero rows, which is still considered empty by this function.

Time complexity: O(1).

Column Selection Utilities

Sourcemodule Cols : sig ... end

Column selection utilities for working with subsets of columns.

Column Access and Manipulation

Functions for working with individual columns within dataframes.

Sourceval get_column : t -> string -> Col.t option

get_column df name returns column data or None.

Returns the packed column if it exists, None otherwise. Use get_column_exn if you want an exception on missing columns.

Time complexity: O(1) - uses internal hash table lookup.

Sourceval get_column_exn : t -> string -> Col.t

get_column_exn df name returns packed column.

Use this when you know the column should exist and want to fail fast if it doesn't.

Time complexity: O(1) - uses internal hash table lookup.

Sourceval to_float32_array : t -> string -> float array option

to_float32_array df name extracts column as float array if it's float32.

Returns Some array if the column exists and is float32 type, None otherwise. Null values in the column become NaN in the array.

  • parameter name

    Column name to extract

Example:

  let df = create [("values", Col.float32 [|1.0; 2.0; Float.nan|])] in
  match to_float32_array df "values" with
  | Some arr -> (* arr = [|1.0; 2.0; nan|] *)
  | None -> (* column doesn't exist or wrong type *)
Sourceval to_float64_array : t -> string -> float array option

to_float64_array df name extracts column as float array if it's float64.

Returns Some array if the column exists and is float64 type, None otherwise. Null values become NaN in the array.

  • parameter name

    Column name to extract

Sourceval to_int32_array : t -> string -> int32 array option

to_int32_array df name extracts column as int32 array if it's int32.

Returns Some array if the column exists and is int32 type, None otherwise. Null values become Int32.min_int in the array.

  • parameter name

    Column name to extract

Sourceval to_int64_array : t -> string -> int64 array option

to_int64_array df name extracts column as int64 array if it's int64.

Returns Some array if the column exists and is int64 type, None otherwise. Null values become Int64.min_int in the array.

  • parameter name

    Column name to extract

Sourceval to_bool_array : t -> string -> bool array option

to_bool_array df name extracts column as bool array if it's bool.

Returns Some array if the column exists and is bool type, None otherwise. Null values become false in the array.

  • parameter name

    Column name to extract

Sourceval to_string_array : t -> string -> string array option

to_string_array df name extracts column as string array if it's string.

Returns Some array if the column exists and is string type, None otherwise. Null values become empty strings in the array.

  • parameter name

    Column name to extract

Sourceval to_float32_options : t -> string -> float option array option

to_float32_options df name extracts column as float option array.

Returns Some array if the column exists and is float32 type, None otherwise. Null values (NaN or masked) become None in the array.

  • parameter name

    Column name to extract

Sourceval to_float64_options : t -> string -> float option array option

to_float64_options df name extracts column as float option array.

Returns Some array if the column exists and is float64 type, None otherwise. Null values (NaN or masked) become None in the array.

  • parameter name

    Column name to extract

Sourceval to_int32_options : t -> string -> int32 option array option

to_int32_options df name extracts column as int32 option array.

Returns Some array if the column exists and is int32 type, None otherwise. Null values (Int32.min_int or masked) become None in the array.

  • parameter name

    Column name to extract

Sourceval to_int64_options : t -> string -> int64 option array option

to_int64_options df name extracts column as int64 option array.

Returns Some array if the column exists and is int64 type, None otherwise. Null values (Int64.min_int or masked) become None in the array.

  • parameter name

    Column name to extract

Sourceval to_bool_options : t -> string -> bool option array option

to_bool_options df name extracts column as bool option array.

Returns Some array if the column exists and is bool type, None otherwise. Null values are represented as None in the array.

  • parameter name

    Column name to extract

Sourceval to_string_options : t -> string -> string option array option

to_string_options df name extracts column as string option array.

Returns Some array if the column exists and is string type, None otherwise. Null values are represented as None in the array.

  • parameter name

    Column name to extract

Sourceval has_column : t -> string -> bool

has_column df name returns true if column exists.

Useful for conditional logic when working with dataframes of unknown structure.

Time complexity: O(1) - uses internal hash table lookup.

Sourceval add_column : t -> string -> Col.t -> t

add_column df name col adds or replaces a column.

If a column with the same name already exists, it is replaced. Otherwise, a new column is added to the dataframe.

  • parameter name

    Column name

  • parameter col

    Column data

Example:

  let df = create [("x", Col.int32 [|1l; 2l|])] in
  let df' = add_column df "y" (Col.float64 [|3.0; 4.0|]) in
  (* df' now has both "x" and "y" columns *)
Sourceval drop_column : t -> string -> t

drop_column df name removes a column.

Returns the dataframe unchanged if the column doesn't exist (no error). This makes it safe to use in pipelines where column existence is uncertain.

  • parameter name

    Column name to remove

Example:

  let df = create [("x", Col.int32 [|1l; 2l|]); ("y", Col.float64 [|3.0; 4.0|])] in
  let df' = drop_column df "y" in
  (* df' now has only "x" column *)
  let df'' = drop_column df' "nonexistent" in
  (* df'' is unchanged (no error) *)
Sourceval drop_columns : t -> string list -> t

drop_columns df names removes multiple columns.

Equivalent to applying drop_column for each name in the list. Non-existent columns are silently ignored.

  • parameter names

    List of column names to remove

Example:

  let df = create [("a", Col.int32 [|1l|]); ("b", Col.int32 [|2l|]); ("c", Col.int32 [|3l|])] in
  let df' = drop_columns df ["a"; "c"] in
  (* df' now has only "b" column *)
Sourceval rename_column : t -> old_name:string -> new_name:string -> t

rename_column df ~old_name ~new_name renames a column.

Changes the name of an existing column. The column data remains unchanged.

  • parameter old_name

    Current column name

  • parameter new_name

    Desired column name

Example:

  let df = create [("old_name", Col.int32 [|1l; 2l|])] in
  let df' = rename_column df ~old_name:"old_name" ~new_name:"new_name" in
  (* df' has column "new_name" instead of "old_name" *)
Sourceval select : t -> string list -> t

select df names returns dataframe with only specified columns.

The resulting dataframe has columns in the order specified by names. This allows both column filtering and reordering in one operation.

  • parameter names

    List of column names to include (in desired order)

  • raises Not_found

    if any column name doesn't exist.

Example:

  let df =
    create
      [
        ("a", Col.int32 [| 1l |]);
        ("b", Col.int32 [| 2l |]);
        ("c", Col.int32 [| 3l |]);
      ]
  in
  let df' = select df [ "c"; "a" ] in
  (* df' has columns "c" and "a" in that order *)
  assert (column_names df' = [ "c"; "a" ])
Sourceval select_loose : t -> string list -> t

select_loose df names returns dataframe with specified columns that exist.

Like select, but silently ignores column names that don't exist. Useful when working with dataframes that may have varying column sets.

  • parameter names

    List of column names to include if they exist

Example:

  let df =
    create [ ("a", Col.int32 [| 1l |]); ("b", Col.int32 [| 2l |]) ]
  in
  let df' = select_loose df [ "a"; "nonexistent"; "b" ] in
  (* df' has columns "a" and "b" only *)
  assert (column_names df' = [ "a"; "b" ])
Sourceval reorder_columns : t -> string list -> t

reorder_columns df names reorders columns according to the specified list.

Columns listed in names appear first in that order. Any existing columns not mentioned in names are appended at the end in their original relative order.

  • parameter names

    List specifying the desired order for some/all columns

  • raises Not_found

    if any name in the list doesn't exist.

Example:

  let df =
    create
      [
        ("a", Col.int32 [| 1l |]);
        ("b", Col.int32 [| 2l |]);
        ("c", Col.int32 [| 3l |]);
      ]
  in
  let df' = reorder_columns df [ "c"; "a" ] in
  (* df' has columns in order: "c", "a", "b" *)
  assert (column_names df' = [ "c"; "a"; "b" ])

Row-wise Operations

The Row module provides a functional interface for computations that operate across multiple columns within each row. This is the primary way to create derived columns and perform row-level filtering.

Sourcemodule Row : sig ... end

Row-wise computations using an applicative interface.

Row Filtering and Transformation

Functions that operate on entire rows, including filtering, sampling, and creating new columns from row computations.

Sourceval head : ?n:int -> t -> t

head ?n df returns the first n rows.

Useful for quick inspection of dataframe contents. If n is larger than the number of rows, returns the entire dataframe.

  • parameter n

    Number of rows to return (default: 5)

Time complexity: O(n * k) where k is the number of columns.

Sourceval tail : ?n:int -> t -> t

tail ?n df returns the last n rows.

Useful for quick inspection of dataframe contents. If n is larger than the number of rows, returns the entire dataframe.

  • parameter n

    Number of rows to return (default: 5)

Time complexity: O(n * k) where k is the number of columns.

Sourceval slice : t -> start:int -> stop:int -> t

slice df ~start ~stop returns rows from start (inclusive) to stop (exclusive).

Uses Python-style slicing semantics. Negative indices are not supported.

  • parameter start

    Starting row index (inclusive, 0-based)

  • parameter stop

    Ending row index (exclusive, 0-based)

  • raises Invalid_argument

    if start < 0, stop < start, or indices are out of bounds.

Example:

  let df = create [("id", Col.int32 [|1l; 2l; 3l; 4l; 5l|])] in
  let middle = slice df ~start:1 ~stop:4 in
  (* Result: rows with ids 2, 3, 4 *)
Sourceval sample : ?n:int -> ?frac:float -> ?replace:bool -> ?seed:int -> t -> t

sample ?n ?frac ?replace ?seed df returns random sample of rows.

Samples rows randomly from the dataframe. Exactly one of n or frac must be specified.

  • parameter n

    Exact number of rows to sample

  • parameter frac

    Fraction of rows to sample (between 0.0 and 1.0)

  • parameter replace

    If true, sample with replacement (default: false)

  • parameter seed

    Random seed for reproducible sampling

  • raises Invalid_argument

    if both n and frac are specified, or neither, or if frac is outside 0, 1, or if n > rows when replace=false.

Example:

  let df = create [("id", Col.int32 [|1l; 2l; 3l; 4l; 5l|])] in
  let sample1 = sample df ~n:3 ~seed:42 () in (* 3 random rows *)
  let sample2 = sample df ~frac:0.6 () in     (* 60% of rows *)
Sourceval filter : t -> bool array -> t

filter df mask filters rows where mask is true.

Creates a new dataframe containing only rows where the corresponding mask element is true. The mask array must have exactly the same length as the number of dataframe rows.

  • parameter mask

    Boolean array indicating which rows to keep

Time complexity: O(n * k) where n is rows and k is columns.

Example:

  let df = create [("age", Col.int32 [|25l; 30l; 35l|])] in
  let mask = [|true; false; true|] in
  let filtered = filter df mask in
  (* Result contains rows 0 and 2 (age 25 and 35) *)
Sourceval filter_by : t -> bool row -> t

filter_by df pred filters rows where predicate returns true.

Sourceval drop_nulls : ?subset:string list -> t -> t

drop_nulls ?subset df removes rows containing any null values.

If subset is provided, only checks those columns for nulls. Otherwise checks all columns. A row is dropped if any value in the checked columns is null.

Null definitions:

  • Float columns: NaN values or entries with mask bit set
  • Integer columns: Int32.min_int/Int64.min_int or entries with mask bit set
  • String/Boolean columns: None values
  • parameter subset

    Columns to check for nulls (default: all columns)

Example:

  let df =
    create
      [
        ("a", Col.float64_opt [| Some 1.0; None; Some 3.0 |]);
        ("b", Col.int32 [| 10l; 20l; 30l |]);
      ]
  in
  let cleaned = drop_nulls df in
  (* Result: 2 rows (indices 0 and 2) *)

  let partial = drop_nulls df ~subset:[ "b" ] in
  (* Result: all 3 rows kept (no nulls in "b") *)
Sourceval fill_missing : t -> string -> with_value: [ `Float of float | `Int32 of int32 | `Int64 of int64 | `String of string | `Bool of bool ] -> t

fill_missing df col_name ~with_value replaces null values in a column.

Creates a new dataframe with null values in the specified column replaced by the given value. The value type must match the column type.

  • parameter col_name

    Column to fill

  • parameter with_value

    Replacement value (must match column type)

  • raises Invalid_argument

    if column doesn't exist or value type doesn't match column type.

Example:

  let df = create [ ("x", Col.float64_opt [| Some 1.0; None; Some 3.0 |]) ] in
  let filled = fill_missing df "x" ~with_value:(`Float 0.0) in
  (* "x" now contains [1.0; 0.0; 3.0] *)
Sourceval has_nulls : t -> string -> bool

has_nulls df col_name checks if a column contains any null values.

  • parameter col_name

    Column to check

Time complexity: O(n) in worst case.

Sourceval null_count : t -> string -> int

null_count df col_name returns the number of null values in a column.

  • parameter col_name

    Column to count nulls in

Time complexity: O(n).

Sourceval drop_duplicates : ?subset:string list -> t -> t

drop_duplicates ?subset df removes duplicate rows.

Keeps the first occurrence of each unique row. If subset is provided, only considers those columns when determining duplicates (but keeps all columns in the result).

  • parameter subset

    Columns to consider for duplicate detection (default: all columns)

  • raises Not_found

    if any column in subset doesn't exist.

Time complexity: O(n * k) where n is rows and k is columns in subset.

Example:

  let df = create [("name", Col.string [|"Alice"; "Bob"; "Alice"|]);
                  ("age", Col.int32 [|25l; 30l; 25l|])] in
  let deduped = drop_duplicates df in
  (* Result has 2 rows: ("Alice", 25) and ("Bob", 30) *)

  let deduped_by_name = drop_duplicates df ~subset:["name"] in
  (* Result has 2 rows: first Alice entry and Bob entry *)
Sourceval concat : axis:[ `Rows | `Columns ] -> t list -> t

concat ~axis dfs concatenates dataframes along the specified axis.

Row concatenation (`Rows):

  • All dataframes must have the same columns (order doesn't matter)
  • Combines all rows into a single dataframe
  • Column types must be compatible across dataframes

Column concatenation (`Columns):

  • All dataframes must have the same number of rows
  • Combines all columns into a single dataframe
  • Column names must be unique across dataframes
  • parameter axis

    Direction of concatenation

  • parameter dfs

    List of dataframes to concatenate (must be non-empty)

  • raises Invalid_argument

    if dataframes are incompatible for the chosen axis or if the list is empty.

Example:

  let df1 = create [("a", Col.int32 [|1l; 2l|])] in
  let df2 = create [("a", Col.int32 [|3l; 4l|])] in
  let rows = concat ~axis:`Rows [df1; df2] in
  (* Result: 4 rows with column "a" *)

  let df3 = create [("b", Col.string [|"x"; "y"|])] in
  let cols = concat ~axis:`Columns [df1; df3] in
  (* Result: 2 rows with columns "a" and "b" *)
Sourceval map : t -> ('a, 'b) Nx.dtype -> 'a row -> ('a, 'b) Nx.t

map df dtype f maps row-wise computation to create a new tensor.

Applies the row computation f to each row of the dataframe and collects the results into a 1D tensor of the specified dtype.

  • parameter dtype

    Target Nx dtype for the result tensor

  • parameter f

    Row computation that produces values of type compatible with dtype

Time complexity: O(n * k) where n is rows and k is complexity of computation f.

Example:

  let df = create [("x", Col.float64 [|1.0; 2.0; 3.0|]);
                  ("y", Col.float64 [|4.0; 5.0; 6.0|])] in
  let sums = map df Nx.float64
    (Row.map2 (Row.float64 "x") (Row.float64 "y") ~f:(+.)) in
  (* sums = tensor [5.0; 7.0; 9.0] *)
Sourceval with_column : t -> string -> ('a, 'b) Nx.dtype -> 'a row -> t

with_column df name dtype f creates new column from row-wise computation.

Applies the row computation f to each row and adds the results as a new column with the specified name and dtype. If a column with that name already exists, it is replaced.

  • parameter name

    Name for the new column

  • parameter dtype

    Nx dtype for the new column

  • parameter f

    Row computation that produces values of type compatible with dtype

Example:

  let df = create [("x", Col.float64 [|1.0; 2.0|]);
                  ("y", Col.float64 [|3.0; 4.0|])] in
  let df' = with_column df "sum" Nx.float64
    (Row.map2 (Row.float64 "x") (Row.float64 "y") ~f:(+.)) in
  (* df' now has columns "x", "y", and "sum" *)
Sourceval with_columns : t -> (string * Col.t) list -> t

with_columns df cols adds or replaces multiple columns at once.

This is an efficient way to add multiple pre-computed columns to a dataframe. Similar to Polars' with_columns or pandas' assign. All columns must have the same length as the dataframe.

  • parameter cols

    List of (column_name, column_data) pairs

Example:

  let df = create [("x", Col.float64 [|1.0; 2.0; 3.0|])] in
  let df' = with_columns df
    [
      ("y", Col.float64 [|4.0; 5.0; 6.0|]);
      ("sum", Col.float64 [|5.0; 7.0; 9.0|]);
    ] in
  (* df' now has columns "x", "y", and "sum" *)
Sourceval with_columns_map : t -> (string * ('a, 'b) Nx.dtype * 'a row) list -> t

with_columns_map df specs computes multiple row-wise columns in one pass.

This is more efficient than multiple with_column calls because it processes all computations in a single iteration over the dataframe rows. Similar to pandas' assign or Polars' with_columns.

Each specification is a tuple of:

  • Column name for the result
  • Nx dtype for the result column
  • Row computation that produces values of that type
  • parameter specs

    List of (name, dtype, computation) specifications

Time complexity: O(n * k) where n is rows and k is total complexity of all computations.

Example:

  let df = create [("x", Col.float64 [|1.0; 2.0|]);
                  ("y", Col.float64 [|3.0; 4.0|])] in
  let df' = with_columns_map df
    [
      ("sum", Nx.float64,
       Row.map2 (Row.float64 "x") (Row.float64 "y") ~f:(+.));
      ("ratio", Nx.float64,
       Row.map2 (Row.float64 "x") (Row.float64 "y") ~f:(/.));
    ] in
  (* df' has original columns plus "sum" and "ratio" *)
Sourceval iter : t -> unit row -> unit

iter df f iterates over rows for side effects.

Applies the row computation f to each row but discards the results. Useful for side effects like printing or accumulating external state.

  • parameter f

    Row computation that produces unit (typically for side effects)

Example:

  let df = create [ ("name", Col.string [| "Alice"; "Bob" |]) ] in
  iter df (Row.map (Row.string "name") ~f:(Printf.printf "Hello %s\n"))
  (* Prints: Hello Alice, Hello Bob *)
Sourceval fold : t -> init:'acc -> f:('acc -> 'acc) row -> 'acc

fold df ~init ~f folds over rows with an accumulator.

The row computation f receives the current accumulator value and should return the updated accumulator. This is useful for reductions that depend on previous row results.

  • parameter init

    Initial accumulator value

  • parameter f

    Row computation that takes and returns accumulator type

Example:

  let df = create [("value", Col.int32 [|1l; 2l; 3l|])] in
  let sum = fold df ~init:0l ~f:(Row.map (Row.int32 "value") ~f:(Int32.add)) in
  (* sum = 6l *)
Sourceval fold_left : t -> init:'acc -> f:('acc -> 'a) row -> ('acc -> 'a -> 'acc) -> 'acc

fold_left df ~init ~f combine folds with explicit combine function.

More flexible than fold because the row computation f can access the current accumulator and produce any type, which is then combined with the accumulator using the combine function.

  • parameter init

    Initial accumulator value

  • parameter f

    Row computation that takes accumulator and produces intermediate result

  • parameter combine

    Function to combine accumulator with intermediate result

Example:

  let df = create [("x", Col.int32 [|1l; 2l; 3l|])] in
  let product = fold_left df ~init:1l
    ~f:(Row.map (Row.int32 "x") ~f:(fun x -> x))
    ~combine:Int32.mul in
  (* product = 6l *)

Sorting and Grouping

Functions for reordering rows and grouping data by key values.

Sourceval sort : t -> 'a row -> compare:('a -> 'a -> int) -> t

sort df key ~compare sorts rows by computed key values.

The key computation is applied to each row to produce sort keys, which are then compared using the provided comparison function.

  • parameter key

    Row computation that produces sort keys

  • parameter compare

    Comparison function for sort keys (< 0, = 0, > 0)

Time complexity: O(n log n * k) where k is the complexity of key computation.

Example:

  let df = create [("first", Col.string [|"Bob"; "Alice"|]);
                  ("last", Col.string [|"Smith"; "Jones"|])] in
  let sorted = sort df
    (Row.map2 (Row.string "last") (Row.string "first")
     ~f:(fun l f -> l ^ ", " ^ f))
    ~compare:String.compare in
  (* Sorted by "last, first" *)
Sourceval sort_values : ?ascending:bool -> t -> string -> t

sort_values ?ascending df name sorts rows by column values.

Sorts the entire dataframe based on values in the specified column. Null values are always sorted to the end regardless of sort direction.

  • parameter ascending

    Sort direction (default: true for ascending)

  • parameter name

    Column to sort by

Time complexity: O(n log n) where n is the number of rows.

Example:

  let df = create [("age", Col.int32 [|30l; 25l; 35l|]);
                  ("name", Col.string [|"Bob"; "Alice"; "Charlie"|])] in
  let sorted = sort_values df "age" in
  (* Result: Alice (25), Bob (30), Charlie (35) *)

  let desc_sorted = sort_values df "age" ~ascending:false in
  (* Result: Charlie (35), Bob (30), Alice (25) *)
Sourceval group_by : t -> 'key row -> ('key * t) list

group_by df key groups rows by key values.

Applies the key computation to each row and groups rows with the same key value together. Returns a list of (key_value, sub_dataframe) pairs.

The order of groups is not guaranteed. Rows within each group maintain their original relative order.

  • parameter key

    Row computation that produces grouping keys

Time complexity: O(n * k) where n is rows and k is key computation complexity.

Example:

  let df = create [("age", Col.int32 [|25l; 30l; 25l; 35l|]);
                  ("name", Col.string [|"Alice"; "Bob"; "Charlie"; "Dave"|])] in
  let age_groups = group_by df (Row.int32 "age") in
  (* Result: [(25l, df_with_alice_charlie); (30l, df_with_bob); (35l, df_with_dave)] *)

  let adult_groups = group_by df
    (Row.map (Row.int32 "age") ~f:(fun age -> age >= 30l)) in
  (* Result: [(false, young_people_df); (true, adults_df)] *)
Sourceval group_by_column : t -> string -> (Col.t * t) list

group_by_column df name groups rows by values in the specified column.

This is a convenience function equivalent to group_by with appropriate column accessor. Returns (group_key_column, sub_dataframe) pairs where the key column contains the single unique value for that group.

  • parameter name

    Column to group by

Example:

  let df = create [("category", Col.string [|"A"; "B"; "A"; "C"|]);
                  ("value", Col.int32 [|1l; 2l; 3l; 4l|])] in
  let groups = group_by_column df "category" in
  (* Result: groups for "A" (rows 0,2), "B" (row 1), "C" (row 3) *)

Aggregations and Column Transformations

The Agg module provides efficient column-wise aggregations and transformations. Operations are organized by data type for type safety and performance.

Sourcemodule Agg : sig ... end

Column-wise aggregation and transformation operations.

Joins and Merges

Join operations combine dataframes based on shared key values. Talon provides SQL-style joins with explicit null handling semantics.

Sourceval join : t -> t -> on:string -> how:[ `Inner | `Left | `Right | `Outer ] -> ?suffixes:(string * string) -> unit -> t

join df1 df2 ~on ~how ?suffixes () joins two dataframes on a common column.

Join types:

  • `Inner: Returns only rows where key exists in both dataframes
  • `Left: Returns all rows from df1, null-filled for missing df2 rows
  • `Right: Returns all rows from df2, null-filled for missing df1 rows
  • `Outer: Returns all rows from both dataframes, null-filled where missing

Null semantics:

  • Null keys never match other null keys (null != null in join logic)
  • Inner joins exclude rows with null keys entirely
  • Outer joins preserve null key rows but don't match them to anything
  • Unmatched rows get null values for all columns from the other dataframe

Column naming:

  • Common key column appears once in result
  • Duplicate column names get suffixes (default: "_x" for df1, "_y" for df2)
  • Use suffixes parameter to customize the suffixes
  • parameter on

    Column name that must exist in both dataframes

  • parameter how

    Type of join to perform

  • parameter suffixes

    Tuple of (left_suffix, right_suffix) for duplicate columns

  • raises Not_found

    if the on column doesn't exist in either dataframe.

Example:

  let customers = create [("id", Col.int32 [|1l; 2l; 3l|]);
                         ("name", Col.string [|"Alice"; "Bob"; "Charlie"|])] in
  let orders = create [("id", Col.int32 [|1l; 1l; 2l|]);
                      ("amount", Col.float64 [|100.; 200.; 150.|])] in
  let result = join customers orders ~on:"id" ~how:`Inner () in
  (* Result has customers with their orders, Alice appears twice *)
Sourceval merge : t -> t -> left_on:string -> right_on:string -> how:[ `Inner | `Left | `Right | `Outer ] -> ?suffixes:(string * string) -> unit -> t

merge df1 df2 ~left_on ~right_on ~how ?suffixes () merges dataframes on different column names.

This is identical to join except it allows using different column names from each dataframe as the join keys. The columns must have compatible types for comparison.

The result contains both key columns (with suffixes if they have the same name).

  • parameter left_on

    Column name from the left dataframe (df1)

  • parameter right_on

    Column name from the right dataframe (df2)

  • parameter how

    Type of join to perform (same semantics as join)

  • parameter suffixes

    Tuple of (left_suffix, right_suffix) for duplicate columns

  • raises Not_found

    if either column doesn't exist in its respective dataframe.

Example:

  let products = create [("product_id", Col.int32 [|1l; 2l; 3l|]);
                        ("name", Col.string [|"Widget"; "Gadget"; "Tool"|])] in
  let sales = create [("item_id", Col.int32 [|1l; 1l; 2l|]);
                     ("quantity", Col.int32 [|10l; 5l; 3l|])] in
  let result = merge products sales
                 ~left_on:"product_id" ~right_on:"item_id"
                 ~how:`Inner () in
  (* Result links products to sales via the id mapping *)

Pivot and Reshape

Reshape operations transform dataframe structure between wide and long formats.

Sourceval pivot : t -> index:string -> columns:string -> values:string -> ?agg_func:[ `Sum | `Mean | `Count | `Min | `Max ] -> unit -> t

pivot df ~index ~columns ~values ?agg_func () creates a pivot table.

Transforms data from long format to wide format by: 1. Grouping by the index column (becomes row identifiers) 2. Using unique values from columns as new column names 3. Filling the table with values, aggregated by agg_func if needed

  • parameter index

    Column to use as row identifiers in the pivot table

  • parameter columns

    Column whose unique values become new column names

  • parameter values

    Column containing the data to fill the pivot table

  • parameter agg_func

    Aggregation function for handling duplicate combinations (default: `Sum for numeric, `Count for others)

  • raises Not_found

    if any specified column doesn't exist.

Example:

  let sales = create [("date", Col.string [|"2023-01"; "2023-01"; "2023-02"|]);
                     ("product", Col.string [|"A"; "B"; "A"|]);
                     ("amount", Col.float64 [|100.; 200.; 150.|])] in
  let pivot_table = pivot sales ~index:"date" ~columns:"product"
                         ~values:"amount" ~agg_func:`Sum () in
  (* Result: dates as rows, products as columns, amounts as values *)
Sourceval melt : t -> ?id_vars:string list -> ?value_vars:string list -> ?var_name:string -> ?value_name:string -> unit -> t

melt df ?id_vars ?value_vars ?var_name ?value_name () unpivots dataframe.

Transforms data from wide format to long format by: 1. Keeping id_vars columns as identifiers (repeated for each melted row) 2. Converting value_vars column names into a single "variable" column 3. Converting value_vars values into a single "value" column

  • parameter id_vars

    Columns to keep as identifiers (default: all non-value_vars)

  • parameter value_vars

    Columns to melt (default: all non-id_vars)

  • parameter var_name

    Name for the new "variable" column (default: "variable")

  • parameter value_name

    Name for the new "value" column (default: "value")

  • raises Not_found

    if any specified column doesn't exist.

Example:

  let wide = create [("id", Col.int32 [|1l; 2l|]);
                    ("A", Col.float64 [|1.0; 3.0|]);
                    ("B", Col.float64 [|2.0; 4.0|])] in
  let long = melt wide ~id_vars:["id"] ~value_vars:["A"; "B"] () in
  (* Result: 4 rows with id, variable ("A" or "B"), and value columns *)

Conversion

Functions for converting dataframes to and from other data structures.

Sourceval to_nx : t -> (float, Bigarray.float32_elt) Nx.t

to_nx df converts all numeric columns to a 2D float32 tensor.

Creates a tensor where:

  • Rows correspond to dataframe rows
  • Columns correspond to numeric dataframe columns (in order)
  • All numeric types are cast to float32
  • Null values become NaN in the result tensor

Only numeric columns (int32, int64, float32, float64) are included. String and boolean columns are ignored.

Example:

  let df =
    create
      [
        ("a", Col.int32 [| 1l; 2l |]);
        ("b", Col.float64 [| 3.0; 4.0 |]);
        ("c", Col.string [| "x"; "y" |]);
      ]
  in
  let tensor = to_nx df in
  (* Result: 2x2 float32 tensor with values [[1.0, 3.0], [2.0, 4.0]] *)
  assert (Nx.shape tensor = [| 2; 2 |])

Display and Inspection

Functions for examining and debugging dataframe contents.

Sourceval print : ?max_rows:int -> ?max_cols:int -> t -> unit

print ?max_rows ?max_cols df pretty-prints dataframe in tabular format.

Displays a formatted table showing column names and values. Large dataframes are truncated for readability.

  • parameter max_rows

    Maximum number of rows to display (default: 10)

  • parameter max_cols

    Maximum number of columns to display (default: 10)

Truncated output shows "..." to indicate hidden rows/columns.

Example output:

   name      age    score
0  Alice     25     85.5
1  Bob       30     92.0
2  Charlie   35     78.5
Sourceval describe : t -> t

describe df returns summary statistics for numeric columns.

Creates a new dataframe with statistical summaries as rows:

  • count: number of non-null values
  • mean: arithmetic mean
  • std: standard deviation
  • min: minimum value
  • 25%: first quartile
  • 50%: median
  • 75%: third quartile
  • max: maximum value

Only numeric columns are included in the result. String and boolean columns are ignored.

Time complexity: O(n * k * log n) where n is rows and k is numeric columns (due to quantile calculations).

Sourceval cast_column : t -> string -> ('a, 'b) Nx.dtype -> t

cast_column df name dtype converts column to specified numeric dtype.

Creates a new dataframe with the specified column converted to the target numeric type. Only works for numeric columns and numeric target types.

Type conversions:

  • int32 ↔ int64: direct conversion
  • int32/int64 → float32/float64: exact for small integers
  • float32/float64 → int32/int64: truncation (may lose precision)
  • float32 ↔ float64: precision change

Null values are preserved through the conversion.

  • parameter name

    Column to convert

  • parameter dtype

    Target Nx dtype (must be numeric)

Example:

  let df = create [("values", Col.int32 [|1l; 2l; 3l|])] in
  let df' = cast_column df "values" Nx.float64 in
  (* "values" column is now float64 type *)
Sourceval info : t -> unit

info df prints detailed dataframe information to stdout.

Displays:

  • Dataframe shape (rows × columns)
  • Column names and types
  • Null value counts per column
  • Memory usage estimates

Useful for debugging and understanding dataframe structure.

Example output:

Dataframe Info:
Shape: (1000, 3)
Columns:
  name (string): 0 nulls
  age (int32): 5 nulls  
  score (float64): 2 nulls
Memory usage: ~24KB