package tdigest

OCaml implementation of the T-Digest algorithm

Sources

2.2.0.tar.gz
`md5=04bf837d0c57ecd9221a2b796c1123de`
`sha512=b6821d4da761f068a23421290ce8d222ed00df96fb7a0dc89694dc37eabc3b20ba3394b99a93a6e24b39b405673e6711589d9874d8dac250da686f474599b353`

Description

The T-Digest is a data structure and algorithm for constructing an approximate distribution for a collection of real numbers presented as a stream.

The T-Digest can estimate percentiles or quantiles extremely accurately even at the tails, while using a fraction of the space.

Additionally, the T-Digest is concatenable, making it a good fit for distributed systems. The internal state of a T-Digest can be exported as a binary string, and the concatenation of any number of those strings can then be imported to form a new T-Digest.

Tdigest

OCaml implementation of the T-Digest algorithm.

``````let td =
Tdigest.create ()
|> Tdigest.add_list [ 10.0; 11.0; 12.0; 13.0 ]
in

Tdigest.percentiles td [ 0.; 0.25; 0.5; 0.75; 1. ]
(* [ Some 10; Some 10.5; Some 11.5; Some 12.5; Some 13 ] *)

Tdigest.p_ranks td [ 9.; 10.; 11.; 12.; 13.; 14. ]
(* [ Some 0; Some 0.125; Some 0.375; Some 0.625; Some 0.875; Some 1 ] *)
``````

The T-Digest is a data structure and algorithm for constructing an approximate distribution for a collection of real numbers presented as a stream.

The T-Digest can estimate percentiles or quantiles extremely accurately even at the tails, while using a fraction of the space of the original data.

A median of medians is not equal to the median of the whole dataset. Percentiles are critical measures that are expensive to compute due to their requirement of having the entire sorted dataset present in one place. These downsides are addressed by using the T-Digest.

A T-Digest is concatenable, making it a good fit for distributed systems. The internal state of a T-Digest can be exported as a binary string, and the concatenation of any number of those strings can then be imported to form a new T-Digest.

``````let combined = Tdigest.merge [ td1; td2; td3 ] in
``````

A T-Digest's state can be stored in a database `VARCHAR`/`TEXT` column and multiple such states can be merged by concatenating strings:

``````-- Combine multiple states in the database
SELECT
STRING_AGG(M.tdigest_state) AS concat_state
FROM my_table AS M
``````
``````(* Then load this combined state into a single T-Digest *)
let combined = Tdigest.of_string concat_state in
``````

This library started off as a port of Will Welch's JavaScript implementation, down to the unit tests. However some modifications have been made to adapt it to OCaml, the most important one being immutability. As such, almost every function in the `Tdigest` module return a new `Tdigest.t`, including "reading" ones since they may trigger intermediate computations worth caching.

Usage

The API is well documented here.

``````opam install tdigest
``````

Marshal

The `Tdigest.t` type cannot be marshalled.

Use the functions in `Tdigest.Marshallable` if your application requires marshalling a T-Digest data structure. Note that `Tdigest.Marshallable.t` is approximately 5 times slower than `Tdigest.t`.

Performance

On an ancient 2015 MacBook Pro, this implementation can incorporate 1,000,000 random floating points in just 770ms.

Exporting and importing state (`to_string`/`of_string`) is cheap.

Dependencies (4)

1. ppx_sexp_conv `>= "v0.16.0"`
2. base `>= "v0.15.0" & < "v0.17.0"`
3. dune `>= "1.9.0"`
4. ocaml `>= "4.10.0"`

Dev Dependencies (2)

1. ppx_custom_printf `>= "v0.16.0" & with-test`
2. ppx_expect `>= "v0.16.0" & with-test`

None

None