Library

Module

Module type

Parameter

Class

Class type

sectionYPositions = computeSectionYPositions($el), 10)" x-init="setTimeout(() => sectionYPositions = computeSectionYPositions($el), 10)">

On This Page

Legend:

Library

Module

Module type

Parameter

Class

Class type

Library

Module

Module type

Parameter

Class

Class type

This library offers efficient implementations of ephemeral sequences and persistent sequences, together with efficient conversions between these two data structures.

The following type abbreviations help give readable types to some operations on sequences.

An index into a sequence is an integer. It is comprised between 0 (included) and the length of the sequence (excluded or included, depending on the circumstances).

An array segment is described by a triple of an array `a`

, a start index `i`

, and a length `k`

.

`type 'a segments = ('a segment -> unit) -> unit`

A sequence of array segments is represented by a higher-order iterator.

A sequence of pairs of array segments of matching lengths is represented by a higher-order iterator.

The result of a comparison is encoded as an integer value. A negative value means "less than"; zero means "equal"; a positive value means "greater than".

`module type ITER = sig ... end`

The signature `ITER`

is the core iterator API. It is common to ephemeral and persistent sequences. Please follow the link for details.

`module type ITER_EPHEMERAL = sig ... end`

The signature `ITER_EPHEMERAL`

extends the iterator API with concepts and operations that exist only in the case of iterators over ephemeral sequences. Please follow the link for details.

`module type SEK = sig ... end`

The signature `SEK`

is the public API of the library. If you are a new user, you do *not* need to follow this link: the library's API appears below anyway. Just read on!

The submodules `Ephemeral`

and `Persistent`

offer implementations of ephemeral (mutable) and persistent (immutable) sequences.

`val front : side`

`val back : side`

A side appears as a parameter to several operations, such as `push`

and `pop`

, which can operate at either end of a sequence.

`val forward : direction`

`val backward : direction`

A direction appears as a parameter to several operations, such as `iter`

, which can traverse the sequence either forward (from front to back) or backward (from back to front).

`val sign : direction -> int`

`sign forward`

is `+1`

. `sign backward`

is `-1`

.

The exception `Empty`

is raised by `pop`

and `peek`

operations when applied to an empty sequence.

The exception `End`

is raised by the iterator operations `get`

, `get_segment`

, `get_and_move`

, and `get_segment_and_jump`

when the iterator has moved past the end of the sequence.

`module Ephemeral : sig ... end`

The submodule `Ephemeral`

, also available under the name `E`

, offers an implementation of ephemeral (mutable) sequences. Please follow the link for details.

`module Persistent : sig ... end`

The submodule `Persistent`

, also available under the name `P`

, offers an implementation of persistent (immutable) sequences. Please follow the link for details.

`module P = Persistent`

`P`

is a short name for the submodule `Persistent`

.

The following functions offer fast conversions between ephemeral and persistent sequences.

`val snapshot : 'a Ephemeral.t -> 'a Persistent.t`

`snapshot s`

constructs and returns a persistent sequence whose elements are the elements of `s`

. It is less efficient than `snapshot_and_clear`

, whose use should be preferred, when possible.

Time complexity: *O(K)*. Note that this operation introduces sharing: therefore, it may increase the cost of subsequent operations.

`val snapshot_and_clear : 'a Ephemeral.t -> 'a Persistent.t`

`snapshot_and_clear s`

constructs and returns a persistent sequence whose elements are the elements of an ephemeral sequence `s`

. As a side effect, it clears `s`

.

Time complexity: *O(min(K,n))*. In other words, the cost is always *O(K)* and, in the specific case of short sequences, it is only *O(n)*.

In the particular case where the ephemeral sequence `s`

has been constructed by applying a series of `push`

operations, the cost of `snapshot_and_clear`

may be charged to these `push`

operations, allowing one to consider that `snapshot_and_clear`

costs *O(1)*.

`val edit : 'a Persistent.t -> 'a Ephemeral.t`

`edit s`

constructs and returns a new ephemeral sequence whose elements are the elements of `s`

.

Time complexity: *O(K)*.

`module Emulated : sig ... end`

The submodule `Emulated`

contains Sek-based replacements for several modules in OCaml's standard library: `Array`

, `List`

, `Queue`

, `Stack`

.

The function call `released()`

does nothing if the library was compiled in release mode, and fails (with an assertion failure) if the library was compiled with assertions enabled.

`module Segment : sig ... end`

The module `Segment`

offers facilities for working with array segments. An array segment is a triple of an array, a start index, and a length.

The following settings can be controlled when the functor `Make`

is applied.

A sequence is represented in memory by a complex data structure that involves chunks, that is, arrays of elements. The data structure is organized in several layers. In the outermost layer, at depth `0`

, chunks of elements are used. In the next layer, at depth `1`

, chunks of chunks of elements are used, and so on: at depth `k+1`

, chunks whose elements are depth-`k`

chunks are used.

The function `capacity`

, whose type is given by the signature `CAPACITY`

, determines the desired capacity of these chunks. This capacity may depend on the depth.

`module type CAPACITY = sig ... end`

The Boolean flag `overwrite_empty_slots`

, described by the signature `OVERWRITE_EMPTY_SLOTS`

, determines whether the content of a just-emptied slot in an ephemeral sequence should be overwritten with the default value that was supplied when the sequence was created.

Setting this flag to `true`

is safe. This is its default value.

Setting this flag to `false`

can save time but can also cause a memory leak, because the obsolete value stored in the slot remains reachable and cannot be collected.

`module type OVERWRITE_EMPTY_SLOTS = sig ... end`

A persistent sequence whose length is less than or equal to a certain threshold can be represented in a simple and compact way (for instance, using an immutable array).

The integer value `threshold`

, described by the signature is `THRESHOLD`

, determines this threshold.

`module type THRESHOLD = sig ... end`

An iterator on an ephemeral sequence is invalidated by most of the operations that affect the sequence. An invalidated iterator can no longer be used.

The flag `check_iterator_validity`

determines whether a use of an invalidated iterator should be detected at runtime. When this flag is `true`

, appropriate runtime checks are enabled, so that an attempt to use an invalidated iterator is detected and gives rise to an `Invalid_argument`

exception. This is the default value. When this flag is `false`

, the runtime checks are disabled, so a violation is not detected, and may cause arbitrary behavior. Setting this flag to `false`

allows a small gain in performance, but is more dangerous.

`module type CHECK_ITERATOR_VALIDITY = sig ... end`

`Make`

The functor `Make`

constructs an implementation of the signature `SEK`

, and allows the user to choose the value of the settings described above. Be warned, however, that the number and the nature of the settings expected by this functor may change in the future. A relatively forward-compatible way of using this functor is to always pass it a structure that is built by including the structure `DefaultSettings`

and overriding the definition of one or more settings.

`module DefaultSettings : sig ... end`

The module `DefaultSettings`

provides a set of recommended settings for the functor `Make`

.

`SupplyDefault`

Every operation that constructs a sequence out of something other than a sequence requires the caller to supply a `default`

value, which serves no visible purpose, but is used internally to initialize empty slots and (optionally) to overwrite a slot that becomes empty.

Because the presence of this argument can be bothersome, we provide a way of removing it by supplying a `default`

value once and for all. To do so, apply the functor `SupplyDefault`

to `Sek`

(or to the result of `Make`

), like so:

```
module S =
SupplyDefault(Sek)(struct type element = int let default = 0 end)
```

`module SupplyDefault (S : sig ... end) (D : sig ... end) : sig ... end`

This section offers a high-level overview of the representation of sequences. We aim to give programmers a mental model of the time complexity of each operation and of the space usage of sequences in the heap.

The elements of a sequence are stored internally in *chunks*, that is, arrays of a fixed capacity *K*. This is why this data structure is known as a *chunk sequence*. A larger value of *K* speeds up certain operations and reduces memory usage, but slows down other operations. The default value of *K* is 128.

More precisely, a sequence is represented using a *front chunk*, which stores the first *K* elements of the sequence, a *back chunk*, which stores the last *K* elements of the sequence, and a *middle sequence*, a sequence of chunks, which stores the remaining elements of the sequence.

As long as no concatenation operations are performed, all of the chunks in the middle sequence are *full*, that is, they contain exactly *K* elements. If concatenation operations are used, then some chunks in the middle sequence may not be full, that is, may contain fewer than *K* elements. Still, these chunks cannot be too sparsely populated: it is guaranteed that two consecutive chunks in the middle sequence always contain more than *K* elements. Thus, it is never possible to fit the contents of two consecutive chunks into just one. This guarantees that these chunks are on average at least half full.

As long as no concatenation operations are performed, the worst-case space usage of a sequence of length *n* is *(1+10/K) * n + O(K)* words. If concatenation operations are used, this bound is doubled and becomes *2 * (1+10/K) * n + O(K)* words. Yet, this bound is unlikely to be reached in practice: one would need to repeatedly meet worst-case scenarios during a series of successive concatenation operations.

As explained above, the middle sequence is a sequence of chunks. How can one represent it efficiently? The answer, naturally, is to use the same idea as in the outermost layer, and to group these chunks together in *chunks of chunks*. Therefore, the middle sequence itself begins with a front chunk (a chunk of chunks), ends with a back chunk (a chunk of chunks), and in the middle, contains a middle sequence (of chunks of chunks). The construction continues down in the same way: this is a "polymorphic recursion" pattern.

Whereas the chunks of elements used in the outermost layer have capacity *K*, the chunks of chunks, chunks of chunks of chunks, and so on, used in the inner layers have capacity *K'*. To optimize the cost of concatenation and splitting, it is desirable for *K'* to be relatively small. The default value of *K'* is 16.

The number of layers in the data structure is logarithmic, and in practice, is typically very small. Because chunks are at least half-full on average, a data structure of depth *d* stores at least *(K/2) (K'/2)^d* elements. This means that the depth does not exceed *D = 1 + log_(K'/2) (n/(K/2))*. The depth *D* is thus *O(log n)*. For the default values *K = 128* and *K' = 16*, the depth of a sequence of up to 1 billion elements is at most 8, and the depth of a sequence of up to 18 million billion elements is at most 16.

In the presentation of the complexity bounds, we treat *K* and *n* as parameters, which means that *O(K)* and *O(n)* are not *O(1)*. However, because *K'* is small, we treat it as a constant: we view *O(K')* as *O(1)*.

Many operations that traverse the data structure have cost *O(D)*, which is also *O(log n)*. In several places, to simplify the statement of the complexity bounds, we make the hypothesis *D <= K*. This means that our analysis is restricted to a domain where the parameters *n* and *K* are correlated, and *n* is not allowed to grow unreasonably large relative to *K*. The numbers presented above show that, with the default values of *K* and *K'* and for all practical values of *n*, we are indeed within this domain.

Below a certain threshold *T*, persistent sequences are represented in a more compact form: a persistent sequence of length less than or equal to *T* is represented as an immutable array of length *n*. The default value of *T* is 64.

This representation is optimal in terms of space, but suffers from an important drawback in terms of time: it takes time *O(T^2)* to build a persistent sequence of length *T* through a series of persistent `push`

operations. For this reason, this representation may change in the future.

In most situations, the programmer can work around this issue by first building an ephemeral sequence through a series of ephemeral `push`

operations, then converting it to a persistent sequence, for a total cost of *O(T+K)*.

In order to simplify our time complexity bounds, we assume that *T* is *O(K)*. Besides, our space complexity bound for persistent sequences (given above) requires *K* to be *O(T)*. Thus, we need *T* and *K* to be reasonably close to each other.

The implementation of ephemeral sequences is carefully designed so as to minimize the constant factors associated with the operations `E.push`

and `E.pop`

and with the operations that allow iteration, such as `E.Iter.get`

and `E.Iter.move`

. As far as the performance of these operations is concerned, ephemeral sequences aim to be competitive with vectors (that is, resizable arrays).

The implementation of persistent sequences is carefully designed so as to minimize the average cost of the operations `P.push`

and `P.pop`

and the constant factors involved in iterating over elements. As far as the performance of these operations is concerned, persistent sequences aim to be as close as possible to linked lists, although it is quite likely that linked lists will always be more efficient when dealing with short-lived, short sequences.

The ephemeral data structure and the persistent data structure have the same representation of their middle sequence, allowing the main two conversion operations, namely `snapshot_and_clear`

and `edit`

, to be extremely efficient: their time complexity is *O(K)*, regardless of the number of elements *n* in the data structure.

The operation `E.copy`

, which creates a copy of an ephemeral sequence, exploits the same mechanism when its `mode`

parameter is ``Share`

. In this mode, the chunks are *shared* between the original sequence and the copy. The time complexity is then *O(K)*.

Naturally, this efficiency comes at a cost. When a chunk is shared between several ephemeral or persistent data structures, its content cannot be modified in arbitrary ways. If one were not careful, an operation on a sequence `s`

could have unintended effects on other sequences that share some chunks with `s`

.

Thus, internally, the data structure keeps track of which chunks are definitely *uniquely owned* and which chunks are possibly *shared*.

The chunks that participate in a persistent sequence are always regarded as shared. Chunks that participate in an ephemeral sequence `s`

may be either uniquely owned by `s`

or shared with other (ephemeral or persistent) sequences.

Operations on uniquely-owned chunks are performed in place, whereas operations on shared chunks may involve a copy-on-write operation.

Thus, a copy operation such as `let s' = E.copy ~mode:`Share s`

has a latent cost. Indeed, after the operation, both the original sequence `s`

and its copy `s'`

lose the unique ownership of their chunks. This implies that subsequent operations on `s`

and `s'`

will be slower than they would have been if no copy had taken place or if `~mode:`Copy`

had been used.

On This Page