sek 20200403 · OCaml Package

This library offers efficient implementations of ephemeral sequences and persistent sequences, together with efficient conversions between these two data structures.

Type Abbreviations

The following type abbreviations help give readable types to some operations on sequences.

type index = int

An index into a sequence is an integer. It is comprised between 0 (included) and the length of the sequence (excluded or included, depending on the circumstances).

type length = int

The length of a sequence is a nonnegative integer.

type capacity = int

The capacity of a chunk is a nonnegative integer.

type depth = int

The depth of a chunk is a nonnegative integer.

Library API

The signature SEK is the public API of the library. If you are a new user, you do not need to follow this link: the library's API appears below anyway. Just read on!

module type SEK = sig ... end

Ephemeral and Persistent Sequences

The submodules Ephemeral and Persistent offer implementations of ephemeral (mutable) and persistent (immutable) sequences.

type side

val front : side

val back : side

A side appears as a parameter to several operations, such as push and pop, which can operate at either end of a sequence.

type direction

val forward : direction

val backward : direction

A direction appears as a parameter to several operations, such as iter, which can traverse the sequence either forward (from front to back) or backward (from back to front).

exception Empty

The exception Empty is raised by pop and peek operations when applied to an empty sequence.

module Ephemeral : sig ... end

The submodule Ephemeral, also available under the name E, offers an implementation of ephemeral (mutable) sequences.

module Persistent : sig ... end

The submodule Persistent, also available under the name P, offers an implementation of persistent (immutable) sequences.

module E = Ephemeral

E is a short name for the submodule Ephemeral.

module P = Persistent

P is a short name for the submodule Persistent.

Conversion Functions

The following functions offer fast conversions between ephemeral and persistent sequences.

val snapshot : 'a Ephemeral.t -> 'a Persistent.t

snapshot s constructs and returns a persistent sequence whose elements are the elements of s. It is less efficient than snapshot_and_clear, whose use should be preferred, when possible.

val snapshot_and_clear : 'a Ephemeral.t -> 'a Persistent.t

snapshot_and_clear s constructs and returns a persistent sequence whose elements are the elements of s. As a side effect, it clears s.

val edit : 'a Persistent.t -> 'a Ephemeral.t

edit s constructs and returns a new ephemeral sequence whose elements are the elements of s.

Emulation Layers

module Queue : sig ... end

The submodule Queue is a replacement for OCaml's standard Queue module, where a queue is implemented as an ephemeral sequence. Elements are enqueued at the back end of the sequence and dequeued at the front end.

module Stack : sig ... end

The submodule Stack is a replacement for OCaml's standard Stack module, where a stack is implemented as an ephemeral sequence. Elements are pushed and popped at the front end of the sequence.

Miscellaneous

val released : unit -> unit

The function call released() does nothing if the library was compiled in release mode, and fails (with an assertion failure) if the library was compiled with assertions enabled.

Settings

The following settings can be controlled by passing parameters to the functor Make (below).

Chunk Capacity

A sequence is represented in memory by a complex data structure that involves chunks, that is, arrays of elements. The data structure is organized in several layers. In the outermost layer, at depth 0, chunks of elements are used. In the next layer, at depth 1, chunks of chunks of elements are used, and so on: at depth k+1, chunks whose elements are depth-k chunks are used.

The functor parameter C, whose signature is CAPACITY, determines the desired capacity of these chunks. This capacity may depend on the depth.

module type CAPACITY = sig ... end

Overwriting Empty Slots

The functor parameter O, whose signature is OVERWRITE_EMPTY_SLOTS, determines whether the content of a just-emptied slot in an ephemeral sequence should be overwritten with the default value that was supplied when the sequence was created.

Setting this parameter to DoOverwriteEmptySlots is safe.

Setting this parameter to DoNotOverwriteEmptySlots can save time but can also cause a memory leak, because the obsolete value stored in the slot remains reachable and cannot be collected.

module type OVERWRITE_EMPTY_SLOTS = sig ... end

Compact Persistent Sequence Threshold

A persistent sequence whose length is less than or equal to a certain threshold can be represented in a simple and compact way (for instance, using an immutable array).

The functor parameter T, whose signature is THRESHOLD, determines this threshold.

module type THRESHOLD = sig ... end

The Functor `Make`

module Make (C : CAPACITY) (O : OVERWRITE_EMPTY_SLOTS) (T : THRESHOLD) : SEK

The functor Make constructs an implementation of the signature SEK, and allows the user to choose the value of the parameters described above. Be warned, however, that the number and types of the parameters of this functor may change in the future. Users who want maximum forward compatibility should not use this functor.

The following are recommended default arguments for the functor Make.

module DefaultCapacity : CAPACITY

module DoOverwriteEmptySlots : OVERWRITE_EMPTY_SLOTS

module DoNotOverwriteEmptySlots : OVERWRITE_EMPTY_SLOTS

module DefaultOverwriteEmptySlots : OVERWRITE_EMPTY_SLOTS

module DefaultThreshold : THRESHOLD

Complexity Guide

This section offers a simplified guide to the complexity of the operations on sequences.

The elements of a sequence are stored internally in chunks, that is, arrays of a fixed capacity K. This is why this data structure is known as a chunk sequence. A larger value of K speeds up certain operations and reduces memory usage, but slows down other operations. A practical value of K is 128.

As long as no concatenation operations are performed, the space usage of a sequence of length n is (1+10/K) * n + O(K) words. If concatenation operations are involved, the worst-case space bound is doubled and becomes 2 * (1+10/K) * n + O(K) words. Yet, this bound is unlikely to be reached in practice.

Below a certain threshold T, persistent sequences are represented in a more compact form: a persistent sequence of length less than (or equal to) T is represented as an immutable array of length n. Thus, below this threshold T, all operations have cost O(n), where n denotes the length of the sequence, except P.create, P.default, P.length, P.is_empty, P.peek and P.get, which have complexity O(1). Observe that it costs O(T^2) to build a persistent sequence of length T through a series of persistent push operations; this is not recommended! Instead, one should first build an ephemeral sequence through a series of ephemeral push operations, then convert it to a persistent sequence. This way, the construction cost is only O(n+K).

In the remainder of this section, we focus on sequences that contain more than T elements, and review the asymptotic complexity of each operation.

We first review a number of operations whose complexity is the same in the ephemeral and persistent cases:

make, init, of_array_segment and of_array have complexity O(n), where n is the length of the sequence that is constructed. In the case of init, this does not count the cost of the calls to the user function f.

default, length, is_empty have complexity O(1).

peek has complexity O(1).

split and E.carve have complexity O(K + log n). More precisely, their complexity is O(log (min (i, n - i))), where i is the index where the sequence is split. This means that splitting near the beginning or end of the sequence is cheap, whereas splitting somewhere in the middle is more expensive.

concat and E.append have complexity O(K + log n), where n is the length of the result of the concatenation.

iter, iteri, fold_left, fold_right have complexity O(n), that is, O(1) per element, not counting the cost of the calls to the user function f.

to_list and to_array have complexity O(n). The conversion to an array is implemented efficiently using a series of blit operations that process O(K) items at a time.

We continue with a review of operations on ephemeral sequences:

E.create has complexity O(K).

E.clear has complexity O(K), unless DoNotOverwriteEmptySlots was passed as an argument to Make, in which case E.clear has complexity O(1).

E.assign has complexity O(K).

E.push and E.pop have amortized complexity O(1 + 1/K * log n), which can be understood as O(1) for all practical purposes. In a series of push operations, without any intervening pop operations, the amortized complexity of E.push is actually O(1). Although the worst-case complexity of E.push and E.pop is O(log n), this worst case is infrequent: it arises at most once every K operations. E.push and E.pop are carefully optimized so as to be competitive with push and pop operations on a vector (also known as a resizable array).

E.get has complexity O(log n). More precisely, E.get s i has complexity O(log (min (i, n - i))), which means that accessing an element near the beginning or end of the sequence is cheap, whereas accessing an element somewhere in the middle is more expensive.

When applied to a sequence whose chunks are not shared with another sequence, E.set costs O(log n). However, when a shared chunk is involved, a copy must be made, so the cost of the operation is O(K + log n). Subsequent set operations that fall in the same chunk will cost only O(log n) again.

E.copy has complexity O(K). However, it has a hidden cost, due to the fact that it causes all chunks to become shared between the original sequence and its copy. Thus, subsequent operations on the sequence and its copy are more costly. The last section of this guide offers some more explanations.

In the case of get and set operations, the O notation hides a fairly large constant factor. In the future, we intend to provide efficient iterators, so as to support performing a series of get and set operations in a more efficient way.

We move on to operations on persistent sequences:

P.create has complexity O(1).

P.push has worst-case complexity O(K + log n). However, this complexity is extremely unlikely to be observed. In fact, the total cost of k successive P.push operations is bounded by O(K + log n + k). Furthermore, in a series of push operations, starting from an empty sequence, the amortized cost of one push operation is only O(1). Indeed, in that scenario, no copying of chunks is required.

P.pop has worst-case complexity O(log n). In fact, the total cost for k successive P.pop operations is bounded by O(log n + k). A P.pop operation never requires copying a chunk.

P.get has complexity O(log n). More precisely, as in the case of E.get, its complexity is O(log (min (i, n - i))), which means that accessing an element near one end of the sequence is cheaper than accessing an element somewhere in the middle.

P.set has complexity O(K + log n), or O(log (min (i, n - i))). P.set always costs at least O(K), because a chunk must always be copied.

We end this review with a discussion of the conversion functions:

snapshot_and_clear has amortized complexity O(1), while edit and snapshot have complexity O(K). In other words, these conversions are cheap: their complexity is not O(n). Because snapshot_and_clear s is faster than snapshot s and does not cause s to become shared, it should be preferred to snapshot when possible.

To conclude this complexity guide, let us give some explanations about the representation of sequences in memory, which helps understand the cost of the operations.

The ephemeral data structure and the persistent data structure have the same representation, up to a few variations that need not be discussed here. This allows the main two conversion operations, namely snapshot_and_clear and edit, to be extremely efficient: their time complexity is O(K), regardless of the number of elements n in the data structure.

The operation E.copy, which creates a copy of an ephemeral sequence, exploits the same mechanism: the chunks are shared between the original sequence and the copy. Its time complexity is O(K).

Naturally, this efficiency comes at a cost. When a chunk is shared between several ephemeral or persistent data structures, its content cannot be modified in arbitrary ways. If one is not careful, an operation on a sequence s could have unintended effects on other sequences that share some chunks with s.

Thus, internally, the data structure keeps track of which chunks are definitely uniquely owned and which chunks are possibly shared.

The chunks that participate in a persistent sequence are always regarded as shared. Chunks that participate in an ephemeral sequence s may be either uniquely owned by s or shared with other (ephemeral or persistent) sequences.

Operations on uniquely owned chunks are performed in place, whereas operations on shared chunks may involve a copy-on-write operation.

It should now be clear why a copy operation, such as E.copy, has a hidden cost. Indeed, after the operation, both the original sequence and its copy lose the unique ownership of their chunks. This implies that many subsequent operations on the original sequence and on its copy will be slower than they could have been if no copy had taken place.

package sek