Page
Library
Module
Module type
Parameter
Class
Class type
Source
Talon
SourceTalon - A dataframe library for OCaml
Talon provides efficient tabular data manipulation with heterogeneous column types, inspired by pandas and polars. Built on top of the Nx tensor library, it offers type-safe operations with comprehensive null handling.
Dataframes are immutable collections of named columns with equal length. Each column can contain different types of data with explicit null semantics:
Talon provides first-class null semantics with explicit null masks for numeric columns, ensuring accurate tracking of missing values:
None
values represent nulls explicitlyUse the _opt
constructors to create columns with explicit null support:
(* Nullable numeric columns *)
Col.float64_opt
[| Some 1.0; None; Some 3.0 |]
Col.int32_opt
[| Some 42l; None; Some 100l |]
(* String/bool columns preserve None directly *)
Col.string_opt
[| Some "hello"; None; Some "world" |]
Col.bool_opt
[| Some true; None; Some false |]
Use option-based accessors to distinguish nulls from sentinel values:
(* Row-wise option accessors *)
Row.float64_opt "score" (* Returns None for nulls *) Row.int32_opt
"count" (* Distinguishes None from Int32.min_int *)
(* Extract as option arrays *)
to_float64_options df "score" (* float option array *)
null + x = null
skipna
parameter)The library maintains type information through GADTs and provides type-specific aggregation modules (Agg.Float
, Agg.Int
, etc.) that ensure operations are only applied to compatible column types.
Operations leverage vectorized Nx tensor computations where possible. Row-wise operations use an applicative interface that compiles to efficient loops. Use with_columns_map
for computing multiple columns in a single pass.
open Talon
(* Create a dataframe from columns *)
let df =
create
[
("name", Col.string_list [ "Alice"; "Bob"; "Charlie" ]);
("age", Col.int32_list [ 25l; 30l; 35l ]);
("score", Col.float64_list [ 85.5; 92.0; 78.5 ]);
("active", Col.bool_list [ true; false; true ]);
]
(* Filter rows where age > 25 *)
let adults =
filter_by df Row.(map (int32 "age") ~f:(fun age -> age > 25l))
(* Aggregations - explicit about expected types *)
let total_score =
Agg.Float.sum df "score" (* 256.0 - works on any numeric *)
let avg_age = Agg.Int.mean df "age" (* 30.0 - returns float *)
let max_name = Agg.String.max df "name" (* Some "Charlie" *)
(* Column operations preserve dtype *)
let cumulative =
Agg.cumsum df "score" (* Returns packed_column with float32 *)
let age_diff = Agg.diff df "age" () (* Returns packed_column with int32 *)
(* Extract column as array for external processing *)
let scores_array = to_float32_array df "score"
(* Group by a computed key *)
let by_category =
group_by df
Row.(
map (float32 "score") ~f:(fun s ->
if s >= 90.0 then "A" else if s >= 80.0 then "B" else "C"))
Type of dataframe.
Dataframes are immutable tabular data structures with named, typed columns. All columns in a dataframe have the same length (number of rows).
Implementation: Internally uses a list of (name, column) pairs for ordering and a hash table for O(1) column lookup by name.
Type for row-wise computations.
This abstract type represents a computation that can be applied to each row of a dataframe to produce a value of type 'a
. Row computations are lazy and only executed when the dataframe is processed.
Row computations form an applicative functor, allowing composition of independent computations from multiple columns.
Columns are the fundamental data containers in Talon dataframes. Each column stores homogeneous data with consistent null handling.
Functions for creating dataframes from various data sources.
empty
creates an empty dataframe with no rows or columns.
This is the neutral element for operations like concat
. Useful as a starting point for building dataframes incrementally.
Example:
let df = empty in
let df' = add_column df "first" (Col.int32 [| 1l; 2l |]) in
assert (shape df' = (2, 1))
create pairs
creates a dataframe from (column_name, column) pairs.
This is the primary constructor for dataframes. Each pair specifies a column name and its data.
Invariants:
Col
module)Example:
let df =
create
[
("name", Col.string [| "Alice"; "Bob" |]);
("age", Col.int32 [| 25l; 30l |]);
("score", Col.float64 [| 85.5; 92.0 |]);
]
in
assert (shape df = (2, 3))
of_tensors ?names tensors
creates dataframe from 1D Nx tensors.
All tensors must have the same shape and dtype. This is efficient for creating dataframes from pre-computed tensor data.
Example:
let t1 = Nx.create Nx.float64 [| 3 |] [| 1.0; 2.0; 3.0 |] in
let t2 = Nx.create Nx.float64 [| 3 |] [| 4.0; 5.0; 6.0 |] in
let df = of_tensors [ t1; t2 ] ~names:[ "x"; "y" ] in
assert (shape df = (3, 2))
from_nx ?names tensor
creates dataframe from 2D tensor.
Each column of the tensor becomes a dataframe column. This is useful for converting tensor data from machine learning operations back to tabular format.
Example:
let data =
Nx.create Nx.float64 [| 2; 3 |] [| 1.0; 2.0; 3.0; 4.0; 5.0; 6.0 |]
in
let df = from_nx data ~names:[ "x"; "y"; "z" ] in
(* Result: 2 rows × 3 columns dataframe *)
assert (shape df = (2, 3))
Functions for examining dataframe structure and metadata.
shape df
returns (num_rows, num_columns).
This is the fundamental size information for the dataframe.
Time complexity: O(1) for non-empty dataframes.
num_rows df
returns number of rows.
Equivalent to fst (shape df)
but more convenient when you only need row count.
Time complexity: O(1) for non-empty dataframes.
num_columns df
returns number of columns.
Equivalent to snd (shape df)
but more convenient when you only need column count.
Time complexity: O(1).
column_names df
returns column names in their current order.
The order matches the column order for operations like print
and to_nx
.
Time complexity: O(k) where k is the number of columns.
val column_types :
t ->
(string
* [ `Float32 | `Float64 | `Int32 | `Int64 | `Bool | `String | `Other ])
list
column_types df
returns column names with their detected types.
Type detection:
`Float32
, `Float64
, `Int32
, `Int64
: Numeric Nx tensor columns`Bool
: Boolean option array columns`String
: String option array columns`Other
: Any other Nx tensor types (e.g., uint8)Useful for programmatic dataframe inspection and type-based operations.
Time complexity: O(k) where k is the number of columns.
is_empty df
returns true if dataframe has no rows.
Note that a dataframe can have columns but zero rows, which is still considered empty by this function.
Time complexity: O(1).
Functions for working with individual columns within dataframes.
get_column df name
returns column data or None
.
Returns the packed column if it exists, None
otherwise. Use get_column_exn
if you want an exception on missing columns.
Time complexity: O(1) - uses internal hash table lookup.
get_column_exn df name
returns packed column.
Use this when you know the column should exist and want to fail fast if it doesn't.
Time complexity: O(1) - uses internal hash table lookup.
to_float32_array df name
extracts column as float array if it's float32.
Returns Some array
if the column exists and is float32 type, None
otherwise. Null values in the column become NaN in the array.
Example:
let df = create [("values", Col.float32 [|1.0; 2.0; Float.nan|])] in
match to_float32_array df "values" with
| Some arr -> (* arr = [|1.0; 2.0; nan|] *)
| None -> (* column doesn't exist or wrong type *)
to_float64_array df name
extracts column as float array if it's float64.
Returns Some array
if the column exists and is float64 type, None
otherwise. Null values become NaN in the array.
to_int32_array df name
extracts column as int32 array if it's int32.
Returns Some array
if the column exists and is int32 type, None
otherwise. Null values become Int32.min_int
in the array.
to_int64_array df name
extracts column as int64 array if it's int64.
Returns Some array
if the column exists and is int64 type, None
otherwise. Null values become Int64.min_int
in the array.
to_bool_array df name
extracts column as bool array if it's bool.
Returns Some array
if the column exists and is bool type, None
otherwise. Null values become false
in the array.
to_string_array df name
extracts column as string array if it's string.
Returns Some array
if the column exists and is string type, None
otherwise. Null values become empty strings in the array.
to_float32_options df name
extracts column as float option array.
Returns Some array
if the column exists and is float32 type, None
otherwise. Null values (NaN or masked) become None
in the array.
to_float64_options df name
extracts column as float option array.
Returns Some array
if the column exists and is float64 type, None
otherwise. Null values (NaN or masked) become None
in the array.
to_int32_options df name
extracts column as int32 option array.
Returns Some array
if the column exists and is int32 type, None
otherwise. Null values (Int32.min_int or masked) become None
in the array.
to_int64_options df name
extracts column as int64 option array.
Returns Some array
if the column exists and is int64 type, None
otherwise. Null values (Int64.min_int or masked) become None
in the array.
to_bool_options df name
extracts column as bool option array.
Returns Some array
if the column exists and is bool type, None
otherwise. Null values are represented as None
in the array.
to_string_options df name
extracts column as string option array.
Returns Some array
if the column exists and is string type, None
otherwise. Null values are represented as None
in the array.
has_column df name
returns true if column exists.
Useful for conditional logic when working with dataframes of unknown structure.
Time complexity: O(1) - uses internal hash table lookup.
add_column df name col
adds or replaces a column.
If a column with the same name already exists, it is replaced. Otherwise, a new column is added to the dataframe.
Example:
let df = create [("x", Col.int32 [|1l; 2l|])] in
let df' = add_column df "y" (Col.float64 [|3.0; 4.0|]) in
(* df' now has both "x" and "y" columns *)
drop_column df name
removes a column.
Returns the dataframe unchanged if the column doesn't exist (no error). This makes it safe to use in pipelines where column existence is uncertain.
Example:
let df = create [("x", Col.int32 [|1l; 2l|]); ("y", Col.float64 [|3.0; 4.0|])] in
let df' = drop_column df "y" in
(* df' now has only "x" column *)
let df'' = drop_column df' "nonexistent" in
(* df'' is unchanged (no error) *)
drop_columns df names
removes multiple columns.
Equivalent to applying drop_column
for each name in the list. Non-existent columns are silently ignored.
Example:
let df = create [("a", Col.int32 [|1l|]); ("b", Col.int32 [|2l|]); ("c", Col.int32 [|3l|])] in
let df' = drop_columns df ["a"; "c"] in
(* df' now has only "b" column *)
rename_column df ~old_name ~new_name
renames a column.
Changes the name of an existing column. The column data remains unchanged.
Example:
let df = create [("old_name", Col.int32 [|1l; 2l|])] in
let df' = rename_column df ~old_name:"old_name" ~new_name:"new_name" in
(* df' has column "new_name" instead of "old_name" *)
select df names
returns dataframe with only specified columns.
The resulting dataframe has columns in the order specified by names
. This allows both column filtering and reordering in one operation.
Example:
let df =
create
[
("a", Col.int32 [| 1l |]);
("b", Col.int32 [| 2l |]);
("c", Col.int32 [| 3l |]);
]
in
let df' = select df [ "c"; "a" ] in
(* df' has columns "c" and "a" in that order *)
assert (column_names df' = [ "c"; "a" ])
select_loose df names
returns dataframe with specified columns that exist.
Like select
, but silently ignores column names that don't exist. Useful when working with dataframes that may have varying column sets.
Example:
let df =
create [ ("a", Col.int32 [| 1l |]); ("b", Col.int32 [| 2l |]) ]
in
let df' = select_loose df [ "a"; "nonexistent"; "b" ] in
(* df' has columns "a" and "b" only *)
assert (column_names df' = [ "a"; "b" ])
reorder_columns df names
reorders columns according to the specified list.
Columns listed in names
appear first in that order. Any existing columns not mentioned in names
are appended at the end in their original relative order.
Example:
let df =
create
[
("a", Col.int32 [| 1l |]);
("b", Col.int32 [| 2l |]);
("c", Col.int32 [| 3l |]);
]
in
let df' = reorder_columns df [ "c"; "a" ] in
(* df' has columns in order: "c", "a", "b" *)
assert (column_names df' = [ "c"; "a"; "b" ])
The Row module provides a functional interface for computations that operate across multiple columns within each row. This is the primary way to create derived columns and perform row-level filtering.
Functions that operate on entire rows, including filtering, sampling, and creating new columns from row computations.
head ?n df
returns the first n rows.
Useful for quick inspection of dataframe contents. If n is larger than the number of rows, returns the entire dataframe.
Time complexity: O(n * k) where k is the number of columns.
tail ?n df
returns the last n rows.
Useful for quick inspection of dataframe contents. If n is larger than the number of rows, returns the entire dataframe.
Time complexity: O(n * k) where k is the number of columns.
slice df ~start ~stop
returns rows from start (inclusive) to stop (exclusive).
Uses Python-style slicing semantics. Negative indices are not supported.
Example:
let df = create [("id", Col.int32 [|1l; 2l; 3l; 4l; 5l|])] in
let middle = slice df ~start:1 ~stop:4 in
(* Result: rows with ids 2, 3, 4 *)
sample ?n ?frac ?replace ?seed df
returns random sample of rows.
Samples rows randomly from the dataframe. Exactly one of n
or frac
must be specified.
Example:
let df = create [("id", Col.int32 [|1l; 2l; 3l; 4l; 5l|])] in
let sample1 = sample df ~n:3 ~seed:42 () in (* 3 random rows *)
let sample2 = sample df ~frac:0.6 () in (* 60% of rows *)
filter df mask
filters rows where mask is true.
Creates a new dataframe containing only rows where the corresponding mask element is true. The mask array must have exactly the same length as the number of dataframe rows.
Time complexity: O(n * k) where n is rows and k is columns.
Example:
let df = create [("age", Col.int32 [|25l; 30l; 35l|])] in
let mask = [|true; false; true|] in
let filtered = filter df mask in
(* Result contains rows 0 and 2 (age 25 and 35) *)
filter_by df pred
filters rows where predicate returns true.
drop_nulls ?subset df
removes rows containing any null values.
If subset
is provided, only checks those columns for nulls. Otherwise checks all columns. A row is dropped if any value in the checked columns is null.
Null definitions:
None
valuesExample:
let df =
create
[
("a", Col.float64_opt [| Some 1.0; None; Some 3.0 |]);
("b", Col.int32 [| 10l; 20l; 30l |]);
]
in
let cleaned = drop_nulls df in
(* Result: 2 rows (indices 0 and 2) *)
let partial = drop_nulls df ~subset:[ "b" ] in
(* Result: all 3 rows kept (no nulls in "b") *)
val fill_missing :
t ->
string ->
with_value:
[ `Float of float
| `Int32 of int32
| `Int64 of int64
| `String of string
| `Bool of bool ] ->
t
fill_missing df col_name ~with_value
replaces null values in a column.
Creates a new dataframe with null values in the specified column replaced by the given value. The value type must match the column type.
Example:
let df = create [ ("x", Col.float64_opt [| Some 1.0; None; Some 3.0 |]) ] in
let filled = fill_missing df "x" ~with_value:(`Float 0.0) in
(* "x" now contains [1.0; 0.0; 3.0] *)
has_nulls df col_name
checks if a column contains any null values.
Time complexity: O(n) in worst case.
null_count df col_name
returns the number of null values in a column.
Time complexity: O(n).
drop_duplicates ?subset df
removes duplicate rows.
Keeps the first occurrence of each unique row. If subset
is provided, only considers those columns when determining duplicates (but keeps all columns in the result).
Time complexity: O(n * k) where n is rows and k is columns in subset.
Example:
let df = create [("name", Col.string [|"Alice"; "Bob"; "Alice"|]);
("age", Col.int32 [|25l; 30l; 25l|])] in
let deduped = drop_duplicates df in
(* Result has 2 rows: ("Alice", 25) and ("Bob", 30) *)
let deduped_by_name = drop_duplicates df ~subset:["name"] in
(* Result has 2 rows: first Alice entry and Bob entry *)
concat ~axis dfs
concatenates dataframes along the specified axis.
Row concatenation (`Rows
):
Column concatenation (`Columns
):
Example:
let df1 = create [("a", Col.int32 [|1l; 2l|])] in
let df2 = create [("a", Col.int32 [|3l; 4l|])] in
let rows = concat ~axis:`Rows [df1; df2] in
(* Result: 4 rows with column "a" *)
let df3 = create [("b", Col.string [|"x"; "y"|])] in
let cols = concat ~axis:`Columns [df1; df3] in
(* Result: 2 rows with columns "a" and "b" *)
map df dtype f
maps row-wise computation to create a new tensor.
Applies the row computation f
to each row of the dataframe and collects the results into a 1D tensor of the specified dtype.
Time complexity: O(n * k) where n is rows and k is complexity of computation f.
Example:
let df = create [("x", Col.float64 [|1.0; 2.0; 3.0|]);
("y", Col.float64 [|4.0; 5.0; 6.0|])] in
let sums = map df Nx.float64
(Row.map2 (Row.float64 "x") (Row.float64 "y") ~f:(+.)) in
(* sums = tensor [5.0; 7.0; 9.0] *)
with_column df name dtype f
creates new column from row-wise computation.
Applies the row computation f
to each row and adds the results as a new column with the specified name and dtype. If a column with that name already exists, it is replaced.
Example:
let df = create [("x", Col.float64 [|1.0; 2.0|]);
("y", Col.float64 [|3.0; 4.0|])] in
let df' = with_column df "sum" Nx.float64
(Row.map2 (Row.float64 "x") (Row.float64 "y") ~f:(+.)) in
(* df' now has columns "x", "y", and "sum" *)
with_columns df cols
adds or replaces multiple columns at once.
This is an efficient way to add multiple pre-computed columns to a dataframe. Similar to Polars' with_columns
or pandas' assign
. All columns must have the same length as the dataframe.
Example:
let df = create [("x", Col.float64 [|1.0; 2.0; 3.0|])] in
let df' = with_columns df
[
("y", Col.float64 [|4.0; 5.0; 6.0|]);
("sum", Col.float64 [|5.0; 7.0; 9.0|]);
] in
(* df' now has columns "x", "y", and "sum" *)
with_columns_map df specs
computes multiple row-wise columns in one pass.
This is more efficient than multiple with_column
calls because it processes all computations in a single iteration over the dataframe rows. Similar to pandas' assign
or Polars' with_columns
.
Each specification is a tuple of:
Time complexity: O(n * k) where n is rows and k is total complexity of all computations.
Example:
let df = create [("x", Col.float64 [|1.0; 2.0|]);
("y", Col.float64 [|3.0; 4.0|])] in
let df' = with_columns_map df
[
("sum", Nx.float64,
Row.map2 (Row.float64 "x") (Row.float64 "y") ~f:(+.));
("ratio", Nx.float64,
Row.map2 (Row.float64 "x") (Row.float64 "y") ~f:(/.));
] in
(* df' has original columns plus "sum" and "ratio" *)
iter df f
iterates over rows for side effects.
Applies the row computation f
to each row but discards the results. Useful for side effects like printing or accumulating external state.
Example:
let df = create [ ("name", Col.string [| "Alice"; "Bob" |]) ] in
iter df (Row.map (Row.string "name") ~f:(Printf.printf "Hello %s\n"))
(* Prints: Hello Alice, Hello Bob *)
fold df ~init ~f
folds over rows with an accumulator.
The row computation f
receives the current accumulator value and should return the updated accumulator. This is useful for reductions that depend on previous row results.
Example:
let df = create [("value", Col.int32 [|1l; 2l; 3l|])] in
let sum = fold df ~init:0l ~f:(Row.map (Row.int32 "value") ~f:(Int32.add)) in
(* sum = 6l *)
fold_left df ~init ~f combine
folds with explicit combine function.
More flexible than fold
because the row computation f
can access the current accumulator and produce any type, which is then combined with the accumulator using the combine
function.
Example:
let df = create [("x", Col.int32 [|1l; 2l; 3l|])] in
let product = fold_left df ~init:1l
~f:(Row.map (Row.int32 "x") ~f:(fun x -> x))
~combine:Int32.mul in
(* product = 6l *)
Functions for reordering rows and grouping data by key values.
sort df key ~compare
sorts rows by computed key values.
The key computation is applied to each row to produce sort keys, which are then compared using the provided comparison function.
Time complexity: O(n log n * k) where k is the complexity of key computation.
Example:
let df = create [("first", Col.string [|"Bob"; "Alice"|]);
("last", Col.string [|"Smith"; "Jones"|])] in
let sorted = sort df
(Row.map2 (Row.string "last") (Row.string "first")
~f:(fun l f -> l ^ ", " ^ f))
~compare:String.compare in
(* Sorted by "last, first" *)
sort_values ?ascending df name
sorts rows by column values.
Sorts the entire dataframe based on values in the specified column. Null values are always sorted to the end regardless of sort direction.
Time complexity: O(n log n) where n is the number of rows.
Example:
let df = create [("age", Col.int32 [|30l; 25l; 35l|]);
("name", Col.string [|"Bob"; "Alice"; "Charlie"|])] in
let sorted = sort_values df "age" in
(* Result: Alice (25), Bob (30), Charlie (35) *)
let desc_sorted = sort_values df "age" ~ascending:false in
(* Result: Charlie (35), Bob (30), Alice (25) *)
group_by df key
groups rows by key values.
Applies the key computation to each row and groups rows with the same key value together. Returns a list of (key_value, sub_dataframe) pairs.
The order of groups is not guaranteed. Rows within each group maintain their original relative order.
Time complexity: O(n * k) where n is rows and k is key computation complexity.
Example:
let df = create [("age", Col.int32 [|25l; 30l; 25l; 35l|]);
("name", Col.string [|"Alice"; "Bob"; "Charlie"; "Dave"|])] in
let age_groups = group_by df (Row.int32 "age") in
(* Result: [(25l, df_with_alice_charlie); (30l, df_with_bob); (35l, df_with_dave)] *)
let adult_groups = group_by df
(Row.map (Row.int32 "age") ~f:(fun age -> age >= 30l)) in
(* Result: [(false, young_people_df); (true, adults_df)] *)
group_by_column df name
groups rows by values in the specified column.
This is a convenience function equivalent to group_by
with appropriate column accessor. Returns (group_key_column, sub_dataframe) pairs where the key column contains the single unique value for that group.
Example:
let df = create [("category", Col.string [|"A"; "B"; "A"; "C"|]);
("value", Col.int32 [|1l; 2l; 3l; 4l|])] in
let groups = group_by_column df "category" in
(* Result: groups for "A" (rows 0,2), "B" (row 1), "C" (row 3) *)
The Agg module provides efficient column-wise aggregations and transformations. Operations are organized by data type for type safety and performance.
Join operations combine dataframes based on shared key values. Talon provides SQL-style joins with explicit null handling semantics.
val join :
t ->
t ->
on:string ->
how:[ `Inner | `Left | `Right | `Outer ] ->
?suffixes:(string * string) ->
unit ->
t
join df1 df2 ~on ~how ?suffixes ()
joins two dataframes on a common column.
Join types:
`Inner
: Returns only rows where key exists in both dataframes`Left
: Returns all rows from df1, null-filled for missing df2 rows`Right
: Returns all rows from df2, null-filled for missing df1 rows`Outer
: Returns all rows from both dataframes, null-filled where missingNull semantics:
Column naming:
suffixes
parameter to customize the suffixesExample:
let customers = create [("id", Col.int32 [|1l; 2l; 3l|]);
("name", Col.string [|"Alice"; "Bob"; "Charlie"|])] in
let orders = create [("id", Col.int32 [|1l; 1l; 2l|]);
("amount", Col.float64 [|100.; 200.; 150.|])] in
let result = join customers orders ~on:"id" ~how:`Inner () in
(* Result has customers with their orders, Alice appears twice *)
val merge :
t ->
t ->
left_on:string ->
right_on:string ->
how:[ `Inner | `Left | `Right | `Outer ] ->
?suffixes:(string * string) ->
unit ->
t
merge df1 df2 ~left_on ~right_on ~how ?suffixes ()
merges dataframes on different column names.
This is identical to join
except it allows using different column names from each dataframe as the join keys. The columns must have compatible types for comparison.
The result contains both key columns (with suffixes if they have the same name).
Example:
let products = create [("product_id", Col.int32 [|1l; 2l; 3l|]);
("name", Col.string [|"Widget"; "Gadget"; "Tool"|])] in
let sales = create [("item_id", Col.int32 [|1l; 1l; 2l|]);
("quantity", Col.int32 [|10l; 5l; 3l|])] in
let result = merge products sales
~left_on:"product_id" ~right_on:"item_id"
~how:`Inner () in
(* Result links products to sales via the id mapping *)
Reshape operations transform dataframe structure between wide and long formats.
val pivot :
t ->
index:string ->
columns:string ->
values:string ->
?agg_func:[ `Sum | `Mean | `Count | `Min | `Max ] ->
unit ->
t
pivot df ~index ~columns ~values ?agg_func ()
creates a pivot table.
Transforms data from long format to wide format by: 1. Grouping by the index
column (becomes row identifiers) 2. Using unique values from columns
as new column names 3. Filling the table with values
, aggregated by agg_func
if needed
Example:
let sales = create [("date", Col.string [|"2023-01"; "2023-01"; "2023-02"|]);
("product", Col.string [|"A"; "B"; "A"|]);
("amount", Col.float64 [|100.; 200.; 150.|])] in
let pivot_table = pivot sales ~index:"date" ~columns:"product"
~values:"amount" ~agg_func:`Sum () in
(* Result: dates as rows, products as columns, amounts as values *)
val melt :
t ->
?id_vars:string list ->
?value_vars:string list ->
?var_name:string ->
?value_name:string ->
unit ->
t
melt df ?id_vars ?value_vars ?var_name ?value_name ()
unpivots dataframe.
Transforms data from wide format to long format by: 1. Keeping id_vars
columns as identifiers (repeated for each melted row) 2. Converting value_vars
column names into a single "variable" column 3. Converting value_vars
values into a single "value" column
Example:
let wide = create [("id", Col.int32 [|1l; 2l|]);
("A", Col.float64 [|1.0; 3.0|]);
("B", Col.float64 [|2.0; 4.0|])] in
let long = melt wide ~id_vars:["id"] ~value_vars:["A"; "B"] () in
(* Result: 4 rows with id, variable ("A" or "B"), and value columns *)
Functions for converting dataframes to and from other data structures.
to_nx df
converts all numeric columns to a 2D float32 tensor.
Creates a tensor where:
Only numeric columns (int32, int64, float32, float64) are included. String and boolean columns are ignored.
Example:
let df =
create
[
("a", Col.int32 [| 1l; 2l |]);
("b", Col.float64 [| 3.0; 4.0 |]);
("c", Col.string [| "x"; "y" |]);
]
in
let tensor = to_nx df in
(* Result: 2x2 float32 tensor with values [[1.0, 3.0], [2.0, 4.0]] *)
assert (Nx.shape tensor = [| 2; 2 |])
Functions for examining and debugging dataframe contents.
print ?max_rows ?max_cols df
pretty-prints dataframe in tabular format.
Displays a formatted table showing column names and values. Large dataframes are truncated for readability.
Truncated output shows "..." to indicate hidden rows/columns.
Example output:
name age score 0 Alice 25 85.5 1 Bob 30 92.0 2 Charlie 35 78.5
describe df
returns summary statistics for numeric columns.
Creates a new dataframe with statistical summaries as rows:
Only numeric columns are included in the result. String and boolean columns are ignored.
Time complexity: O(n * k * log n) where n is rows and k is numeric columns (due to quantile calculations).
cast_column df name dtype
converts column to specified numeric dtype.
Creates a new dataframe with the specified column converted to the target numeric type. Only works for numeric columns and numeric target types.
Type conversions:
Null values are preserved through the conversion.
Example:
let df = create [("values", Col.int32 [|1l; 2l; 3l|])] in
let df' = cast_column df "values" Nx.float64 in
(* "values" column is now float64 type *)
info df
prints detailed dataframe information to stdout.
Displays:
Useful for debugging and understanding dataframe structure.
Example output:
Dataframe Info: Shape: (1000, 3) Columns: name (string): 0 nulls age (int32): 5 nulls score (float64): 2 nulls Memory usage: ~24KB