package talon

You can search for identifiers within the package.

in-package search v0.2.0

On This Page

Key Concepts
Quick Start
Column Operations
DataFrame Creation
Shape and Inspection
1. Column Selection Utilities
Column Access and Manipulation
Row-wise Operations
Row Filtering and Transformation
Sorting and Grouping
Aggregations and Column Transformations
Joins and Merges
Pivot and Reshape
Conversion
Display and Inspection

package talon

talon
- CHANGES
- README
- Library talon
  - Talon
    
    Col
    
    Cols
    
    Row
    
    Agg
    
    Agg
    
    Float
    
    Int
    
    String
    
    Bool
- Library talon.csv
  - Talon_csv
- Library talon.json
  - Talon_json
- Sources
  - talon
    
    talon.ml
  - talon.csv
    
    talon_csv.ml
  - talon.json
    
    talon_json.ml

Legend:
Page
Library
Module
Module type
Parameter
Class
Class type
Source

Module `Talon`Source

Talon - A dataframe library for OCaml

Talon provides efficient tabular data manipulation with heterogeneous column types, inspired by pandas and polars. Built on top of the Nx tensor library, it offers type-safe operations with comprehensive null handling.

Dataframes are immutable collections of named columns with equal length. Each column can contain different types of data with explicit null semantics:

Numeric tensors (via Nx): float32/float64, int32/int64
String option arrays with explicit null support
Boolean option arrays with explicit null support

Key Concepts

Null Handling

Talon provides first-class null semantics with explicit null masks for numeric columns, ensuring accurate tracking of missing values:

Null Representation

Numeric columns: Optional boolean mask tracks null entries explicitly. When no mask exists, sentinel values (NaN for floats, Int32.min_int/Int64.min_int for integers) indicate nulls. The mask takes precedence when present.
String/Boolean columns: None values represent nulls explicitly

Creating Nullable Columns

Use the _opt constructors to create columns with explicit null support:

  (* Nullable numeric columns *)
  Col.float64_opt
    [| Some 1.0; None; Some 3.0 |]
    Col.int32_opt
    [| Some 42l; None; Some 100l |]
    (* String/bool columns preserve None directly *)
    Col.string_opt
    [| Some "hello"; None; Some "world" |]
    Col.bool_opt
    [| Some true; None; Some false |]

Accessing Values with Null Awareness

Use option-based accessors to distinguish nulls from sentinel values:

  (* Row-wise option accessors *)
  Row.float64_opt "score" (* Returns None for nulls *) Row.int32_opt
    "count" (* Distinguishes None from Int32.min_int *)
    (* Extract as option arrays *)
    to_float64_options df "score" (* float option array *)

Null Propagation

Operations propagate nulls: null + x = null
Aggregations skip nulls by default (configurable with skipna parameter)
Mask-aware aggregations properly exclude masked entries from computations

Type Safety

The library maintains type information through GADTs and provides type-specific aggregation modules (Agg.Float, Agg.Int, etc.) that ensure operations are only applied to compatible column types.

Performance

Operations leverage vectorized Nx tensor computations where possible. Row-wise operations use an applicative interface that compiles to efficient loops. Use with_columns_map for computing multiple columns in a single pass.

Quick Start

  open Talon

  (* Create a dataframe from columns *)
  let df =
    create
      [
        ("name", Col.string_list [ "Alice"; "Bob"; "Charlie" ]);
        ("age", Col.int32_list [ 25l; 30l; 35l ]);
        ("score", Col.float64_list [ 85.5; 92.0; 78.5 ]);
        ("active", Col.bool_list [ true; false; true ]);
      ]

  (* Filter rows where age > 25 *)
  let adults =
    filter_by df Row.(map (int32 "age") ~f:(fun age -> age > 25l))

  (* Aggregations - explicit about expected types *)
  let total_score =
    Agg.Float.sum df "score" (* 256.0 - works on any numeric *)

  let avg_age = Agg.Int.mean df "age" (* 30.0 - returns float *)
  let max_name = Agg.String.max df "name" (* Some "Charlie" *)

  (* Column operations preserve dtype *)
  let cumulative =
    Agg.cumsum df "score" (* Returns packed_column with float32 *)

  let age_diff = Agg.diff df "age" () (* Returns packed_column with int32 *)

  (* Extract column as array for external processing *)
  let scores_array = to_float32_array df "score"

  (* Group by a computed key *)
  let by_category =
    group_by df
      Row.(
        map (float32 "score") ~f:(fun s ->
            if s >= 90.0 then "A" else if s >= 80.0 then "B" else "C"))

Sourcetype t

Type of dataframe.

Dataframes are immutable tabular data structures with named, typed columns. All columns in a dataframe have the same length (number of rows).

Implementation: Internally uses a list of (name, column) pairs for ordering and a hash table for O(1) column lookup by name.

Sourcetype 'a row

Type for row-wise computations.

This abstract type represents a computation that can be applied to each row of a dataframe to produce a value of type 'a. Row computations are lazy and only executed when the dataframe is processed.

Row computations form an applicative functor, allowing composition of independent computations from multiple columns.

Column Operations

Columns are the fundamental data containers in Talon dataframes. Each column stores homogeneous data with consistent null handling.

Sourcemodule Col : sig ... end

Column creation and manipulation for heterogeneous data types.

DataFrame Creation

Functions for creating dataframes from various data sources.

Sourceval empty : t

empty creates an empty dataframe with no rows or columns.

This is the neutral element for operations like concat. Useful as a starting point for building dataframes incrementally.

Example:

  let df = empty in
  let df' = add_column df "first" (Col.int32 [| 1l; 2l |]) in
  assert (shape df' = (2, 1))

Sourceval create : (string * Col.t) list -> t

create pairs creates a dataframe from (column_name, column) pairs.

This is the primary constructor for dataframes. Each pair specifies a column name and its data.

Invariants:

Column names must be unique (case-sensitive)
All columns must have exactly the same length
Numeric columns must be 1D tensors (checked automatically by Col module)

parameter pairs
List of (column_name, column_data) pairs

raises Invalid_argument
if duplicate column names exist, column lengths differ, or any column has invalid structure.

Example:

  let df =
    create
      [
        ("name", Col.string [| "Alice"; "Bob" |]);
        ("age", Col.int32 [| 25l; 30l |]);
        ("score", Col.float64 [| 85.5; 92.0 |]);
      ]
  in
  assert (shape df = (2, 3))

Sourceval of_tensors : ?names:string list -> ('a, 'b) Nx.t list -> t

of_tensors ?names tensors creates dataframe from 1D Nx tensors.

All tensors must have the same shape and dtype. This is efficient for creating dataframes from pre-computed tensor data.

parameter names
Column names (default: "col0", "col1", etc.)

parameter tensors
List of 1D tensors with identical shapes and dtypes

raises Invalid_argument
if tensors have inconsistent shapes, any tensor is not 1D, names are not unique, or wrong number of names provided.

Example:

  let t1 = Nx.create Nx.float64 [| 3 |] [| 1.0; 2.0; 3.0 |] in
  let t2 = Nx.create Nx.float64 [| 3 |] [| 4.0; 5.0; 6.0 |] in
  let df = of_tensors [ t1; t2 ] ~names:[ "x"; "y" ] in
  assert (shape df = (3, 2))

Sourceval from_nx : ?names:string list -> ('a, 'b) Nx.t -> t

from_nx ?names tensor creates dataframe from 2D tensor.

Each column of the tensor becomes a dataframe column. This is useful for converting tensor data from machine learning operations back to tabular format.

parameter names
Column names (default: "col0", "col1", etc.)

parameter tensor
2D tensor where rows are observations, columns are variables

raises Invalid_argument
if tensor is not 2D or names are not unique.

Example:

  let data =
    Nx.create Nx.float64 [| 2; 3 |] [| 1.0; 2.0; 3.0; 4.0; 5.0; 6.0 |]
  in
  let df = from_nx data ~names:[ "x"; "y"; "z" ] in
  (* Result: 2 rows × 3 columns dataframe *)
  assert (shape df = (2, 3))

Shape and Inspection

Functions for examining dataframe structure and metadata.

Sourceval shape : t -> int * int

shape df returns (num_rows, num_columns).

This is the fundamental size information for the dataframe.

Time complexity: O(1) for non-empty dataframes.

Sourceval num_rows : t -> int

num_rows df returns number of rows.

Equivalent to fst (shape df) but more convenient when you only need row count.

Time complexity: O(1) for non-empty dataframes.

Sourceval num_columns : t -> int

num_columns df returns number of columns.

Equivalent to snd (shape df) but more convenient when you only need column count.

Time complexity: O(1).

Sourceval column_names : t -> string list

column_names df returns column names in their current order.

The order matches the column order for operations like print and to_nx.

Time complexity: O(k) where k is the number of columns.

Source

val column_types : 
  t ->
  (string
   * [ `Float32 | `Float64 | `Int32 | `Int64 | `Bool | `String | `Other ])
    list

column_types df returns column names with their detected types.

Type detection:

`Float32, `Float64, `Int32, `Int64: Numeric Nx tensor columns
`Bool: Boolean option array columns
`String: String option array columns
`Other: Any other Nx tensor types (e.g., uint8)

Useful for programmatic dataframe inspection and type-based operations.

Time complexity: O(k) where k is the number of columns.

Sourceval is_empty : t -> bool

is_empty df returns true if dataframe has no rows.

Note that a dataframe can have columns but zero rows, which is still considered empty by this function.

Time complexity: O(1).

Column Selection Utilities

Sourcemodule Cols : sig ... end

Column selection utilities for working with subsets of columns.

Column Access and Manipulation

Functions for working with individual columns within dataframes.

Sourceval get_column : t -> string -> Col.t option

get_column df name returns column data or None.

Returns the packed column if it exists, None otherwise. Use get_column_exn if you want an exception on missing columns.

Time complexity: O(1) - uses internal hash table lookup.

Sourceval get_column_exn : t -> string -> Col.t

get_column_exn df name returns packed column.

Use this when you know the column should exist and want to fail fast if it doesn't.

raises Not_found
if column doesn't exist.

Time complexity: O(1) - uses internal hash table lookup.

Sourceval to_float32_array : t -> string -> float array option

to_float32_array df name extracts column as float array if it's float32.

Returns Some array if the column exists and is float32 type, None otherwise. Null values in the column become NaN in the array.

parameter name
Column name to extract

Example:

  let df = create [("values", Col.float32 [|1.0; 2.0; Float.nan|])] in
  match to_float32_array df "values" with
  | Some arr -> (* arr = [|1.0; 2.0; nan|] *)
  | None -> (* column doesn't exist or wrong type *)

Sourceval to_float64_array : t -> string -> float array option

to_float64_array df name extracts column as float array if it's float64.

Returns Some array if the column exists and is float64 type, None otherwise. Null values become NaN in the array.

parameter name
Column name to extract

Sourceval to_int32_array : t -> string -> int32 array option

to_int32_array df name extracts column as int32 array if it's int32.

Returns Some array if the column exists and is int32 type, None otherwise. Null values become Int32.min_int in the array.

parameter name
Column name to extract

Sourceval to_int64_array : t -> string -> int64 array option

to_int64_array df name extracts column as int64 array if it's int64.

Returns Some array if the column exists and is int64 type, None otherwise. Null values become Int64.min_int in the array.

parameter name
Column name to extract

Sourceval to_bool_array : t -> string -> bool array option

to_bool_array df name extracts column as bool array if it's bool.

Returns Some array if the column exists and is bool type, None otherwise. Null values become false in the array.

parameter name
Column name to extract

Sourceval to_string_array : t -> string -> string array option

to_string_array df name extracts column as string array if it's string.

Returns Some array if the column exists and is string type, None otherwise. Null values become empty strings in the array.

parameter name
Column name to extract

Sourceval to_float32_options : t -> string -> float option array option

to_float32_options df name extracts column as float option array.

Returns Some array if the column exists and is float32 type, None otherwise. Null values (NaN or masked) become None in the array.

parameter name
Column name to extract

Sourceval to_float64_options : t -> string -> float option array option

to_float64_options df name extracts column as float option array.

Returns Some array if the column exists and is float64 type, None otherwise. Null values (NaN or masked) become None in the array.

parameter name
Column name to extract

Sourceval to_int32_options : t -> string -> int32 option array option

to_int32_options df name extracts column as int32 option array.

Returns Some array if the column exists and is int32 type, None otherwise. Null values (Int32.min_int or masked) become None in the array.

parameter name
Column name to extract

Sourceval to_int64_options : t -> string -> int64 option array option

to_int64_options df name extracts column as int64 option array.

Returns Some array if the column exists and is int64 type, None otherwise. Null values (Int64.min_int or masked) become None in the array.

parameter name
Column name to extract

Sourceval to_bool_options : t -> string -> bool option array option

to_bool_options df name extracts column as bool option array.

Returns Some array if the column exists and is bool type, None otherwise. Null values are represented as None in the array.

parameter name
Column name to extract

Sourceval to_string_options : t -> string -> string option array option

to_string_options df name extracts column as string option array.

Returns Some array if the column exists and is string type, None otherwise. Null values are represented as None in the array.

parameter name
Column name to extract

Sourceval has_column : t -> string -> bool

has_column df name returns true if column exists.

Useful for conditional logic when working with dataframes of unknown structure.

Time complexity: O(1) - uses internal hash table lookup.

Sourceval add_column : t -> string -> Col.t -> t

add_column df name col adds or replaces a column.

If a column with the same name already exists, it is replaced. Otherwise, a new column is added to the dataframe.

parameter name
Column name

parameter col
Column data

raises Invalid_argument
if column length doesn't match num_rows df.

Example:

  let df = create [("x", Col.int32 [|1l; 2l|])] in
  let df' = add_column df "y" (Col.float64 [|3.0; 4.0|]) in
  (* df' now has both "x" and "y" columns *)

Sourceval drop_column : t -> string -> t

drop_column df name removes a column.

Returns the dataframe unchanged if the column doesn't exist (no error). This makes it safe to use in pipelines where column existence is uncertain.

parameter name
Column name to remove

Example:

  let df = create [("x", Col.int32 [|1l; 2l|]); ("y", Col.float64 [|3.0; 4.0|])] in
  let df' = drop_column df "y" in
  (* df' now has only "x" column *)
  let df'' = drop_column df' "nonexistent" in
  (* df'' is unchanged (no error) *)

Sourceval drop_columns : t -> string list -> t

drop_columns df names removes multiple columns.

Equivalent to applying drop_column for each name in the list. Non-existent columns are silently ignored.

parameter names
List of column names to remove

Example:

  let df = create [("a", Col.int32 [|1l|]); ("b", Col.int32 [|2l|]); ("c", Col.int32 [|3l|])] in
  let df' = drop_columns df ["a"; "c"] in
  (* df' now has only "b" column *)

Sourceval rename_column : t -> old_name:string -> new_name:string -> t

rename_column df ~old_name ~new_name renames a column.

Changes the name of an existing column. The column data remains unchanged.

parameter old_name
Current column name

parameter new_name
Desired column name

raises Not_found
if old_name doesn't exist.

raises Invalid_argument
if new_name already exists as a different column.

Example:

  let df = create [("old_name", Col.int32 [|1l; 2l|])] in
  let df' = rename_column df ~old_name:"old_name" ~new_name:"new_name" in
  (* df' has column "new_name" instead of "old_name" *)

Sourceval select : t -> string list -> t

select df names returns dataframe with only specified columns.

The resulting dataframe has columns in the order specified by names. This allows both column filtering and reordering in one operation.

parameter names
List of column names to include (in desired order)

raises Not_found
if any column name doesn't exist.

Example:

  let df =
    create
      [
        ("a", Col.int32 [| 1l |]);
        ("b", Col.int32 [| 2l |]);
        ("c", Col.int32 [| 3l |]);
      ]
  in
  let df' = select df [ "c"; "a" ] in
  (* df' has columns "c" and "a" in that order *)
  assert (column_names df' = [ "c"; "a" ])

Sourceval select_loose : t -> string list -> t

select_loose df names returns dataframe with specified columns that exist.

Like select, but silently ignores column names that don't exist. Useful when working with dataframes that may have varying column sets.

parameter names
List of column names to include if they exist

Example:

  let df =
    create [ ("a", Col.int32 [| 1l |]); ("b", Col.int32 [| 2l |]) ]
  in
  let df' = select_loose df [ "a"; "nonexistent"; "b" ] in
  (* df' has columns "a" and "b" only *)
  assert (column_names df' = [ "a"; "b" ])

Sourceval reorder_columns : t -> string list -> t

reorder_columns df names reorders columns according to the specified list.

Columns listed in names appear first in that order. Any existing columns not mentioned in names are appended at the end in their original relative order.

parameter names
List specifying the desired order for some/all columns

raises Not_found
if any name in the list doesn't exist.

Example:

  let df =
    create
      [
        ("a", Col.int32 [| 1l |]);
        ("b", Col.int32 [| 2l |]);
        ("c", Col.int32 [| 3l |]);
      ]
  in
  let df' = reorder_columns df [ "c"; "a" ] in
  (* df' has columns in order: "c", "a", "b" *)
  assert (column_names df' = [ "c"; "a"; "b" ])

Row-wise Operations

The Row module provides a functional interface for computations that operate across multiple columns within each row. This is the primary way to create derived columns and perform row-level filtering.

Sourcemodule Row : sig ... end

Row-wise computations using an applicative interface.

Row Filtering and Transformation

Functions that operate on entire rows, including filtering, sampling, and creating new columns from row computations.

Sourceval head : ?n:int -> t -> t

head ?n df returns the first n rows.

Useful for quick inspection of dataframe contents. If n is larger than the number of rows, returns the entire dataframe.

parameter n
Number of rows to return (default: 5)

Time complexity: O(n * k) where k is the number of columns.

Sourceval tail : ?n:int -> t -> t

tail ?n df returns the last n rows.

Useful for quick inspection of dataframe contents. If n is larger than the number of rows, returns the entire dataframe.

parameter n
Number of rows to return (default: 5)

Time complexity: O(n * k) where k is the number of columns.

Sourceval slice : t -> start:int -> stop:int -> t

slice df ~start ~stop returns rows from start (inclusive) to stop (exclusive).

Uses Python-style slicing semantics. Negative indices are not supported.

parameter start
Starting row index (inclusive, 0-based)

parameter stop
Ending row index (exclusive, 0-based)

raises Invalid_argument
if start < 0, stop < start, or indices are out of bounds.

Example:

  let df = create [("id", Col.int32 [|1l; 2l; 3l; 4l; 5l|])] in
  let middle = slice df ~start:1 ~stop:4 in
  (* Result: rows with ids 2, 3, 4 *)

Sourceval sample : ?n:int -> ?frac:float -> ?replace:bool -> ?seed:int -> t -> t

sample ?n ?frac ?replace ?seed df returns random sample of rows.

Samples rows randomly from the dataframe. Exactly one of n or frac must be specified.

parameter n
Exact number of rows to sample

parameter frac
Fraction of rows to sample (between 0.0 and 1.0)

parameter replace
If true, sample with replacement (default: false)

parameter seed
Random seed for reproducible sampling

raises Invalid_argument
if both n and frac are specified, or neither, or if frac is outside 0, 1, or if n > rows when replace=false.

Example:

  let df = create [("id", Col.int32 [|1l; 2l; 3l; 4l; 5l|])] in
  let sample1 = sample df ~n:3 ~seed:42 () in (* 3 random rows *)
  let sample2 = sample df ~frac:0.6 () in     (* 60% of rows *)

Sourceval filter : t -> bool array -> t

filter df mask filters rows where mask is true.

Creates a new dataframe containing only rows where the corresponding mask element is true. The mask array must have exactly the same length as the number of dataframe rows.

parameter mask
Boolean array indicating which rows to keep

raises Invalid_argument
if mask length doesn't match num_rows df.

Time complexity: O(n * k) where n is rows and k is columns.

Example:

  let df = create [("age", Col.int32 [|25l; 30l; 35l|])] in
  let mask = [|true; false; true|] in
  let filtered = filter df mask in
  (* Result contains rows 0 and 2 (age 25 and 35) *)

Sourceval filter_by : t -> bool row -> t

filter_by df pred filters rows where predicate returns true.

Sourceval drop_nulls : ?subset:string list -> t -> t

drop_nulls ?subset df removes rows containing any null values.

If subset is provided, only checks those columns for nulls. Otherwise checks all columns. A row is dropped if any value in the checked columns is null.

Null definitions:

Float columns: NaN values or entries with mask bit set
Integer columns: Int32.min_int/Int64.min_int or entries with mask bit set
String/Boolean columns: None values

parameter subset
Columns to check for nulls (default: all columns)

Example:

  let df =
    create
      [
        ("a", Col.float64_opt [| Some 1.0; None; Some 3.0 |]);
        ("b", Col.int32 [| 10l; 20l; 30l |]);
      ]
  in
  let cleaned = drop_nulls df in
  (* Result: 2 rows (indices 0 and 2) *)

  let partial = drop_nulls df ~subset:[ "b" ] in
  (* Result: all 3 rows kept (no nulls in "b") *)

Source

val fill_missing : 
  t ->
  string ->
  with_value:
    [ `Float of float
    | `Int32 of int32
    | `Int64 of int64
    | `String of string
    | `Bool of bool ] ->
  t

fill_missing df col_name ~with_value replaces null values in a column.

Creates a new dataframe with null values in the specified column replaced by the given value. The value type must match the column type.

parameter col_name
Column to fill

parameter with_value
Replacement value (must match column type)

raises Invalid_argument
if column doesn't exist or value type doesn't match column type.

Example:

  let df = create [ ("x", Col.float64_opt [| Some 1.0; None; Some 3.0 |]) ] in
  let filled = fill_missing df "x" ~with_value:(`Float 0.0) in
  (* "x" now contains [1.0; 0.0; 3.0] *)

Sourceval has_nulls : t -> string -> bool

has_nulls df col_name checks if a column contains any null values.

parameter col_name
Column to check

raises Invalid_argument
if column doesn't exist.

Time complexity: O(n) in worst case.

Sourceval null_count : t -> string -> int

null_count df col_name returns the number of null values in a column.

parameter col_name
Column to count nulls in

raises Invalid_argument
if column doesn't exist.

Time complexity: O(n).

Sourceval drop_duplicates : ?subset:string list -> t -> t

drop_duplicates ?subset df removes duplicate rows.

Keeps the first occurrence of each unique row. If subset is provided, only considers those columns when determining duplicates (but keeps all columns in the result).

parameter subset
Columns to consider for duplicate detection (default: all columns)

raises Not_found
if any column in subset doesn't exist.

Time complexity: O(n * k) where n is rows and k is columns in subset.

Example:

  let df = create [("name", Col.string [|"Alice"; "Bob"; "Alice"|]);
                  ("age", Col.int32 [|25l; 30l; 25l|])] in
  let deduped = drop_duplicates df in
  (* Result has 2 rows: ("Alice", 25) and ("Bob", 30) *)

  let deduped_by_name = drop_duplicates df ~subset:["name"] in
  (* Result has 2 rows: first Alice entry and Bob entry *)

Sourceval concat : axis:[ `Rows | `Columns ] -> t list -> t

concat ~axis dfs concatenates dataframes along the specified axis.

Row concatenation (`Rows):

All dataframes must have the same columns (order doesn't matter)
Combines all rows into a single dataframe
Column types must be compatible across dataframes

Column concatenation (`Columns):

All dataframes must have the same number of rows
Combines all columns into a single dataframe
Column names must be unique across dataframes

parameter axis
Direction of concatenation

parameter dfs
List of dataframes to concatenate (must be non-empty)

raises Invalid_argument
if dataframes are incompatible for the chosen axis or if the list is empty.

Example:

  let df1 = create [("a", Col.int32 [|1l; 2l|])] in
  let df2 = create [("a", Col.int32 [|3l; 4l|])] in
  let rows = concat ~axis:`Rows [df1; df2] in
  (* Result: 4 rows with column "a" *)

  let df3 = create [("b", Col.string [|"x"; "y"|])] in
  let cols = concat ~axis:`Columns [df1; df3] in
  (* Result: 2 rows with columns "a" and "b" *)

Sourceval map : t -> ('a, 'b) Nx.dtype -> 'a row -> ('a, 'b) Nx.t

map df dtype f maps row-wise computation to create a new tensor.

Applies the row computation f to each row of the dataframe and collects the results into a 1D tensor of the specified dtype.

parameter dtype
Target Nx dtype for the result tensor

parameter f
Row computation that produces values of type compatible with dtype

Time complexity: O(n * k) where n is rows and k is complexity of computation f.

Example:

  let df = create [("x", Col.float64 [|1.0; 2.0; 3.0|]);
                  ("y", Col.float64 [|4.0; 5.0; 6.0|])] in
  let sums = map df Nx.float64
    (Row.map2 (Row.float64 "x") (Row.float64 "y") ~f:(+.)) in
  (* sums = tensor [5.0; 7.0; 9.0] *)

Sourceval with_column : t -> string -> ('a, 'b) Nx.dtype -> 'a row -> t

with_column df name dtype f creates new column from row-wise computation.

Applies the row computation f to each row and adds the results as a new column with the specified name and dtype. If a column with that name already exists, it is replaced.

parameter name
Name for the new column

parameter dtype
Nx dtype for the new column

parameter f
Row computation that produces values of type compatible with dtype

Example:

  let df = create [("x", Col.float64 [|1.0; 2.0|]);
                  ("y", Col.float64 [|3.0; 4.0|])] in
  let df' = with_column df "sum" Nx.float64
    (Row.map2 (Row.float64 "x") (Row.float64 "y") ~f:(+.)) in
  (* df' now has columns "x", "y", and "sum" *)

Sourceval with_columns : t -> (string * Col.t) list -> t

with_columns df cols adds or replaces multiple columns at once.

This is an efficient way to add multiple pre-computed columns to a dataframe. Similar to Polars' with_columns or pandas' assign. All columns must have the same length as the dataframe.

parameter cols
List of (column_name, column_data) pairs

raises Invalid_argument
if any column length doesn't match the dataframe.

Example:

  let df = create [("x", Col.float64 [|1.0; 2.0; 3.0|])] in
  let df' = with_columns df
    [
      ("y", Col.float64 [|4.0; 5.0; 6.0|]);
      ("sum", Col.float64 [|5.0; 7.0; 9.0|]);
    ] in
  (* df' now has columns "x", "y", and "sum" *)

Sourceval with_columns_map : t -> (string * ('a, 'b) Nx.dtype * 'a row) list -> t

with_columns_map df specs computes multiple row-wise columns in one pass.

This is more efficient than multiple with_column calls because it processes all computations in a single iteration over the dataframe rows. Similar to pandas' assign or Polars' with_columns.

Each specification is a tuple of:

Column name for the result
Nx dtype for the result column
Row computation that produces values of that type

parameter specs
List of (name, dtype, computation) specifications

Time complexity: O(n * k) where n is rows and k is total complexity of all computations.

Example:

  let df = create [("x", Col.float64 [|1.0; 2.0|]);
                  ("y", Col.float64 [|3.0; 4.0|])] in
  let df' = with_columns_map df
    [
      ("sum", Nx.float64,
       Row.map2 (Row.float64 "x") (Row.float64 "y") ~f:(+.));
      ("ratio", Nx.float64,
       Row.map2 (Row.float64 "x") (Row.float64 "y") ~f:(/.));
    ] in
  (* df' has original columns plus "sum" and "ratio" *)

Sourceval iter : t -> unit row -> unit

iter df f iterates over rows for side effects.

Applies the row computation f to each row but discards the results. Useful for side effects like printing or accumulating external state.

parameter f
Row computation that produces unit (typically for side effects)

Example:

  let df = create [ ("name", Col.string [| "Alice"; "Bob" |]) ] in
  iter df (Row.map (Row.string "name") ~f:(Printf.printf "Hello %s\n"))
  (* Prints: Hello Alice, Hello Bob *)

Sourceval fold : t -> init:'acc -> f:('acc -> 'acc) row -> 'acc

fold df ~init ~f folds over rows with an accumulator.

The row computation f receives the current accumulator value and should return the updated accumulator. This is useful for reductions that depend on previous row results.

parameter init
Initial accumulator value

parameter f
Row computation that takes and returns accumulator type

Example:

  let df = create [("value", Col.int32 [|1l; 2l; 3l|])] in
  let sum = fold df ~init:0l ~f:(Row.map (Row.int32 "value") ~f:(Int32.add)) in
  (* sum = 6l *)

Source

val fold_left : 
  t ->
  init:'acc ->
  f:('acc -> 'a) row ->
  ('acc -> 'a -> 'acc) ->
  'acc

fold_left df ~init ~f combine folds with explicit combine function.

More flexible than fold because the row computation f can access the current accumulator and produce any type, which is then combined with the accumulator using the combine function.

parameter init
Initial accumulator value

parameter f
Row computation that takes accumulator and produces intermediate result

parameter combine
Function to combine accumulator with intermediate result

Example:

  let df = create [("x", Col.int32 [|1l; 2l; 3l|])] in
  let product = fold_left df ~init:1l
    ~f:(Row.map (Row.int32 "x") ~f:(fun x -> x))
    ~combine:Int32.mul in
  (* product = 6l *)

Sorting and Grouping

Functions for reordering rows and grouping data by key values.

Sourceval sort : t -> 'a row -> compare:('a -> 'a -> int) -> t

sort df key ~compare sorts rows by computed key values.

The key computation is applied to each row to produce sort keys, which are then compared using the provided comparison function.

parameter key
Row computation that produces sort keys

parameter compare
Comparison function for sort keys (< 0, = 0, > 0)

Time complexity: O(n log n * k) where k is the complexity of key computation.

Example:

  let df = create [("first", Col.string [|"Bob"; "Alice"|]);
                  ("last", Col.string [|"Smith"; "Jones"|])] in
  let sorted = sort df
    (Row.map2 (Row.string "last") (Row.string "first")
     ~f:(fun l f -> l ^ ", " ^ f))
    ~compare:String.compare in
  (* Sorted by "last, first" *)

Sourceval sort_values : ?ascending:bool -> t -> string -> t

sort_values ?ascending df name sorts rows by column values.

Sorts the entire dataframe based on values in the specified column. Null values are always sorted to the end regardless of sort direction.

parameter ascending
Sort direction (default: true for ascending)

parameter name
Column to sort by

raises Not_found
if column doesn't exist.

Time complexity: O(n log n) where n is the number of rows.

Example:

  let df = create [("age", Col.int32 [|30l; 25l; 35l|]);
                  ("name", Col.string [|"Bob"; "Alice"; "Charlie"|])] in
  let sorted = sort_values df "age" in
  (* Result: Alice (25), Bob (30), Charlie (35) *)

  let desc_sorted = sort_values df "age" ~ascending:false in
  (* Result: Charlie (35), Bob (30), Alice (25) *)

Sourceval group_by : t -> 'key row -> ('key * t) list

group_by df key groups rows by key values.

Applies the key computation to each row and groups rows with the same key value together. Returns a list of (key_value, sub_dataframe) pairs.

The order of groups is not guaranteed. Rows within each group maintain their original relative order.

parameter key
Row computation that produces grouping keys

Time complexity: O(n * k) where n is rows and k is key computation complexity.

Example:

  let df = create [("age", Col.int32 [|25l; 30l; 25l; 35l|]);
                  ("name", Col.string [|"Alice"; "Bob"; "Charlie"; "Dave"|])] in
  let age_groups = group_by df (Row.int32 "age") in
  (* Result: [(25l, df_with_alice_charlie); (30l, df_with_bob); (35l, df_with_dave)] *)

  let adult_groups = group_by df
    (Row.map (Row.int32 "age") ~f:(fun age -> age >= 30l)) in
  (* Result: [(false, young_people_df); (true, adults_df)] *)

Sourceval group_by_column : t -> string -> (Col.t * t) list

group_by_column df name groups rows by values in the specified column.

This is a convenience function equivalent to group_by with appropriate column accessor. Returns (group_key_column, sub_dataframe) pairs where the key column contains the single unique value for that group.

parameter name
Column to group by

raises Not_found
if column doesn't exist.

Example:

  let df = create [("category", Col.string [|"A"; "B"; "A"; "C"|]);
                  ("value", Col.int32 [|1l; 2l; 3l; 4l|])] in
  let groups = group_by_column df "category" in
  (* Result: groups for "A" (rows 0,2), "B" (row 1), "C" (row 3) *)

Aggregations and Column Transformations

The Agg module provides efficient column-wise aggregations and transformations. Operations are organized by data type for type safety and performance.

Sourcemodule Agg : sig ... end

Column-wise aggregation and transformation operations.

Joins and Merges

Join operations combine dataframes based on shared key values. Talon provides SQL-style joins with explicit null handling semantics.

Source

val join : 
  t ->
  t ->
  on:string ->
  how:[ `Inner | `Left | `Right | `Outer ] ->
  ?suffixes:(string * string) ->
  unit ->
  t

join df1 df2 ~on ~how ?suffixes () joins two dataframes on a common column.

Join types:

`Inner: Returns only rows where key exists in both dataframes
`Left: Returns all rows from df1, null-filled for missing df2 rows
`Right: Returns all rows from df2, null-filled for missing df1 rows
`Outer: Returns all rows from both dataframes, null-filled where missing

Null semantics:

Null keys never match other null keys (null != null in join logic)
Inner joins exclude rows with null keys entirely
Outer joins preserve null key rows but don't match them to anything
Unmatched rows get null values for all columns from the other dataframe

Column naming:

Common key column appears once in result
Duplicate column names get suffixes (default: "_x" for df1, "_y" for df2)
Use suffixes parameter to customize the suffixes

parameter on
Column name that must exist in both dataframes

parameter how
Type of join to perform

parameter suffixes
Tuple of (left_suffix, right_suffix) for duplicate columns

raises Not_found
if the on column doesn't exist in either dataframe.

raises Invalid_argument
if the on columns have incompatible types.

Example:

  let customers = create [("id", Col.int32 [|1l; 2l; 3l|]);
                         ("name", Col.string [|"Alice"; "Bob"; "Charlie"|])] in
  let orders = create [("id", Col.int32 [|1l; 1l; 2l|]);
                      ("amount", Col.float64 [|100.; 200.; 150.|])] in
  let result = join customers orders ~on:"id" ~how:`Inner () in
  (* Result has customers with their orders, Alice appears twice *)

Source

val merge : 
  t ->
  t ->
  left_on:string ->
  right_on:string ->
  how:[ `Inner | `Left | `Right | `Outer ] ->
  ?suffixes:(string * string) ->
  unit ->
  t

merge df1 df2 ~left_on ~right_on ~how ?suffixes () merges dataframes on different column names.

This is identical to join except it allows using different column names from each dataframe as the join keys. The columns must have compatible types for comparison.

The result contains both key columns (with suffixes if they have the same name).

parameter left_on
Column name from the left dataframe (df1)

parameter right_on
Column name from the right dataframe (df2)

parameter how
Type of join to perform (same semantics as join)

parameter suffixes
Tuple of (left_suffix, right_suffix) for duplicate columns

raises Not_found
if either column doesn't exist in its respective dataframe.

raises Invalid_argument
if the key columns have incompatible types.

Example:

  let products = create [("product_id", Col.int32 [|1l; 2l; 3l|]);
                        ("name", Col.string [|"Widget"; "Gadget"; "Tool"|])] in
  let sales = create [("item_id", Col.int32 [|1l; 1l; 2l|]);
                     ("quantity", Col.int32 [|10l; 5l; 3l|])] in
  let result = merge products sales
                 ~left_on:"product_id" ~right_on:"item_id"
                 ~how:`Inner () in
  (* Result links products to sales via the id mapping *)

Pivot and Reshape

Reshape operations transform dataframe structure between wide and long formats.

Source

val pivot : 
  t ->
  index:string ->
  columns:string ->
  values:string ->
  ?agg_func:[ `Sum | `Mean | `Count | `Min | `Max ] ->
  unit ->
  t

pivot df ~index ~columns ~values ?agg_func () creates a pivot table.

Transforms data from long format to wide format by: 1. Grouping by the index column (becomes row identifiers) 2. Using unique values from columns as new column names 3. Filling the table with values, aggregated by agg_func if needed

parameter index
Column to use as row identifiers in the pivot table

parameter columns
Column whose unique values become new column names

parameter values
Column containing the data to fill the pivot table

parameter agg_func
Aggregation function for handling duplicate combinations (default: `Sum for numeric, `Count for others)

raises Not_found
if any specified column doesn't exist.

raises Invalid_argument
if values column is incompatible with agg_func.

Example:

  let sales = create [("date", Col.string [|"2023-01"; "2023-01"; "2023-02"|]);
                     ("product", Col.string [|"A"; "B"; "A"|]);
                     ("amount", Col.float64 [|100.; 200.; 150.|])] in
  let pivot_table = pivot sales ~index:"date" ~columns:"product"
                         ~values:"amount" ~agg_func:`Sum () in
  (* Result: dates as rows, products as columns, amounts as values *)

Source

val melt : 
  t ->
  ?id_vars:string list ->
  ?value_vars:string list ->
  ?var_name:string ->
  ?value_name:string ->
  unit ->
  t

melt df ?id_vars ?value_vars ?var_name ?value_name () unpivots dataframe.

Transforms data from wide format to long format by: 1. Keeping id_vars columns as identifiers (repeated for each melted row) 2. Converting value_vars column names into a single "variable" column 3. Converting value_vars values into a single "value" column

parameter id_vars
Columns to keep as identifiers (default: all non-value_vars)

parameter value_vars
Columns to melt (default: all non-id_vars)

parameter var_name
Name for the new "variable" column (default: "variable")

parameter value_name
Name for the new "value" column (default: "value")

raises Not_found
if any specified column doesn't exist.

raises Invalid_argument
if id_vars and value_vars overlap or cover all columns.

Example:

  let wide = create [("id", Col.int32 [|1l; 2l|]);
                    ("A", Col.float64 [|1.0; 3.0|]);
                    ("B", Col.float64 [|2.0; 4.0|])] in
  let long = melt wide ~id_vars:["id"] ~value_vars:["A"; "B"] () in
  (* Result: 4 rows with id, variable ("A" or "B"), and value columns *)

Conversion

Functions for converting dataframes to and from other data structures.

Sourceval to_nx : t -> (float, Bigarray.float32_elt) Nx.t

to_nx df converts all numeric columns to a 2D float32 tensor.

Creates a tensor where:

Rows correspond to dataframe rows
Columns correspond to numeric dataframe columns (in order)
All numeric types are cast to float32
Null values become NaN in the result tensor

Only numeric columns (int32, int64, float32, float64) are included. String and boolean columns are ignored.

raises Invalid_argument
if the dataframe contains no numeric columns.

Example:

  let df =
    create
      [
        ("a", Col.int32 [| 1l; 2l |]);
        ("b", Col.float64 [| 3.0; 4.0 |]);
        ("c", Col.string [| "x"; "y" |]);
      ]
  in
  let tensor = to_nx df in
  (* Result: 2x2 float32 tensor with values [[1.0, 3.0], [2.0, 4.0]] *)
  assert (Nx.shape tensor = [| 2; 2 |])

Display and Inspection

Functions for examining and debugging dataframe contents.

Sourceval print : ?max_rows:int -> ?max_cols:int -> t -> unit

print ?max_rows ?max_cols df pretty-prints dataframe in tabular format.

Displays a formatted table showing column names and values. Large dataframes are truncated for readability.

parameter max_rows
Maximum number of rows to display (default: 10)

parameter max_cols
Maximum number of columns to display (default: 10)

Truncated output shows "..." to indicate hidden rows/columns.

Example output:

   name      age    score
0  Alice     25     85.5
1  Bob       30     92.0
2  Charlie   35     78.5

Sourceval describe : t -> t

describe df returns summary statistics for numeric columns.

Creates a new dataframe with statistical summaries as rows:

count: number of non-null values
mean: arithmetic mean
std: standard deviation
min: minimum value
25%: first quartile
50%: median
75%: third quartile
max: maximum value

Only numeric columns are included in the result. String and boolean columns are ignored.

Time complexity: O(n * k * log n) where n is rows and k is numeric columns (due to quantile calculations).

Sourceval cast_column : t -> string -> ('a, 'b) Nx.dtype -> t

cast_column df name dtype converts column to specified numeric dtype.

Creates a new dataframe with the specified column converted to the target numeric type. Only works for numeric columns and numeric target types.

Type conversions:

int32 ↔ int64: direct conversion
int32/int64 → float32/float64: exact for small integers
float32/float64 → int32/int64: truncation (may lose precision)
float32 ↔ float64: precision change

Null values are preserved through the conversion.

parameter name
Column to convert

parameter dtype
Target Nx dtype (must be numeric)

raises Not_found
if column doesn't exist.

raises Invalid_argument
if source column is not numeric or conversion fails.

Example:

  let df = create [("values", Col.int32 [|1l; 2l; 3l|])] in
  let df' = cast_column df "values" Nx.float64 in
  (* "values" column is now float64 type *)

Sourceval info : t -> unit

info df prints detailed dataframe information to stdout.

Displays:

Dataframe shape (rows × columns)
Column names and types
Null value counts per column
Memory usage estimates

Useful for debugging and understanding dataframe structure.

Example output:

Dataframe Info:
Shape: (1000, 3)
Columns:
  name (string): 0 nulls
  age (int32): 5 nulls  
  score (float64): 2 nulls
Memory usage: ~24KB

package talon

Module TalonSource

Key Concepts

Null Handling

Null Representation

Creating Nullable Columns

Accessing Values with Null Awareness

Null Propagation

Type Safety

Performance

Quick Start

Column Operations

DataFrame Creation

Shape and Inspection

Column Selection Utilities

Column Access and Manipulation

Row-wise Operations

Row Filtering and Transformation

Sorting and Grouping

Aggregations and Column Transformations

Joins and Merges

Pivot and Reshape

Conversion

Display and Inspection

Module `Talon`Source