Functional summary for a variable that has zero or more occurences per period. accumulate
is expected to be called in_period_count
times before finalise
is.
Stats Gathered
- Global (non-nan) max, argmax, min, argmin and mean of the variable.
- The very last non-nan sample encountered minus the very first non-nan sample encountered.
- Global histogram made of
distribution_bin_count
bins. Option: distribution_scale
to control the spreading scale of the bins, either on a linear or a log scale. Computed using Bentov. - A curve made of
out_sample_count
points. Options: evolution_smoothing
to control the smoothing, either using EMA, or no smoothing at all.
Histograms
The histograms are all computed using https://github.com/barko/bentov.
Bentov
computes dynamic histograms without the need for a priori informations on the distributions, while maintaining a constant memory space and a marginal CPU footprint.
The implementation of that library is pretty straightforward, but not perfect; the CPU footprint doesn't scale well with the number of bins.
The computed histograms depend on the order of the operations, some marginal unsabilities are to be expected.
Bentov
is good at spreading the bins on the input space. Since some histograms will be shown on a log plot, the log10 of those values is passed to Bentov
instead, but the json will store real seconds.
Log Scale
When a variable has to be displayed on a log scale, the scale
option can be set to `Log
in order for some adjustments to be made.
In the histogram, the bins have to spread on a log scale.
When smoothing the evolution, the EMA decay has to be calculated on a log scale.
Gotcha: All the input samples should be strictly greater than 0, so that they don't fail their conversion to log.
Periods are Decoupled from Samples
When a Variable_summary
(vs
) is created, the number of periods has to be declared right away through the in_period_count
parameter, but vs
is very flexible when it comes to the number of samples shown to it on each period.
The simplest use case of vs
is when there is exactly one sample for each period. In that case, accumulate acc samples
is called using a list of length 1. For example: when a period corresponds to a cycle of an algorithm, and the variable is a timestamp.
The more advanced use case of vs
is when there are a varying number a samples for each period. For example: when a period corresponds to a cycle of an algorithm, and the variable is the time taken by a buffer flush that may happen 0, 1 or more times per cycle.
In that later case, the evolution
curve may contain NaNs before and after sample points.
Possible Future Evolutions
- A period-wise histogram, similar to Grafana's heatmaps: "A heatmap is like a histogram, but over time where each time slice represents its own histogram.".
- Variance evolution. Either without smoothing or using exponential moving variance (see wikipedia).
type t = {
max_value : float * int;
min_value : float * int;
mean : float;
diff : float;
distribution : histo;
evolution : curve;
}
val create_acc :
evolution_smoothing:[ `Ema of float * float | `None ] ->
evolution_resampling_mode:[ `Interpolate | `Next_neighbor | `Prev_neighbor ] ->
distribution_bin_count:int ->
scale:[ `Linear | `Log ] ->
in_period_count:int ->
out_sample_count:int ->
acc
val accumulate : acc -> float list -> acc