package bstr

  1. Overview
  2. Docs

Module BstrSource

A small library for manipulating bigstrings.

A bigstring is a mutable data structure that contains a fixed-length sequence of bytes. Each byte can be indexed in constant time for reading and writing.

Given a byte sequence bstr of length len, we can access each of the len bytes of bstr via its index in the sequence. Indexes start at 0, and will call an index valid in bstr if it falls within the range [0...len-1] (inclusive). A position is the point between two bytes or at the beginning or end of the sequence. We call a position valid in bstr if it falls within the range [0...len] (inclusive). Note that byte at index n is between positions n and n+1.

Two parameters off and len are said to designate a valid range of bstr if len >= 0 and off and off+len are valid positions in bstr.

Byte sequences can be modified in place, for instance via the set and blit functions described below.

Bigstrings & Bigarrays.

Bigstring is a specialised version of Bigarray that not only handles bytes in the form of Character but also imposes a "C-like" (see Bigarray.c_layout) view as described above and allows common functions such as memcpy(3) or memmove(3) to be offered.

For more details about Bigstrings and Bigarrays, we invite you to read the Bigarray documentation, which offers more general functions that can be applied to Bigstrings.

Bigstrings & Bytes.

Like bytes, a bigstring is a mutable data structure that contains a fixed-length sequence of bytes. However, a bigstring has a few special features that can make it more interesting to use than bytes.

Bigstrings and the Garbage Collector.

A bigstring is not allocated in the same way as a standard OCaml value. In fact, the byte sequence that the bigstring refers to is found in the C heap (rather than the OCaml heap). This means that the byte sequence can come from a malloc(3) or a function requesting a particular memory area from the system such as Unix.map_file.

This particularity has an implication with the GC: the byte sequence is not relocatable. That is to say that during the cycle of the Garbage Collector, this byte sequence does not move — in contrast, a bytes can be moved by the GC (typically, from the minor heap to the major heap).

Thus, bigstrings have advantages and disadvantages compared to bytes due to this particularity:

  • Creating a bigstring can be expensive. Whether it is with malloc(3)/create or Unix.map_file, creating a bigstring will always be more expensive than creating bytes with OCaml. For small byte sequences, it is therefore preferable to use bytes.
  • Since a bigstring cannot be moved, its position can be shared by Threads and/or Domains without interacting with the GC. An example is being able to perform a complex computation in parallel from the bytes of this sequence without blocking the Garbage Collector during this computation.

Depending on these characteristics, it may be more advantageous to use a bigstring rather than bytes. This basically depends on your usage, and the special features of bigstrings can unlock opportunities to outperform byte calculations or analysis.

Bigstring and slice.

Another advantage of bigstrings is that copying is avoided when extracting part of a larger bigstring. This is because the sub function returns a "proxy" of the original bigstring.

In this respect, and to be very precise, sub avoids copying but the creation of this "proxy" remains costly. In addition, this library is distributed with a new Slice_bstr module. The latter offers a new type whose Slice_bstr.sub function is much less costly than sub.

Bigstrings.

Constructors.

Sourceval empty : t

empty is an empty bigstring.

Sourceval create : int -> t

create len returns a new byte sequence of length len. The sequence is unitialized and contains arbitrary bytes.

Sourceval make : int -> char -> t

make len chr is t of length len with each index holding the character chr.

Sourceval copy : t -> t

copy t returns a new byte sequence that contains the same bytes as the argument.

Sourceval init : int -> (int -> char) -> t

init len fn returns a fresh byte sequence of length len, with character idx initialized to the result of fn idx (in increasing index order).

Memory-safe Operations.

Sourceval of_string : string -> t

of_string str returns a new t that contains the contents of the given string str.

Sourceval string : ?off:int -> ?len:int -> string -> t

string ~off ~len str is the sub-buffer of str that starts at position off (defaults to 0) and stops at position off + len (defaults to String.length str). str is fully-replaced by a fresh allocated t.

Sourceval sub_string : t -> off:int -> len:int -> string

sub_string bstr ~off ~len returns a string of length len containing the bytes of bstr starting at off.

Sourceval to_string : t -> string

to_string bstr is equivalent to sub_string bstr ~off:0 ~len:(length bstr).

Sourceval length : t -> int

length bstr is the number of bytes in bstr.

Sourceval get : t -> int -> char

get bstr i is the byte of bstr' at index i. This is equivalent to the bstr.{i} notation.

Sourceval set : t -> int -> char -> unit

set t i chr modifies t in place, replacing the byte at index i with chr.

Sourceval unsafe_get : t -> int -> char

unsafe_get t idx is like get except no bounds checking is performed.

Sourceval unsafe_set : t -> int -> char -> unit

unsafe_set t idx chr is like set except no bounds checking is performed.

Sourceval chop : ?rev:bool -> t -> char option

chop bstr returns the first element of bstr or the last element if rev = true. If bstr is empty, it returns None.

Sourceval concat : string -> t list -> t

concat sep ts concatenates the list of bigstrings ts, inserting the separator string sep between each.

Sourceval extend : t -> int -> int -> t

extend bstr left right returns a new bigstring that contains the bytes of bstr, with left zero bytes prepended and right zero byte appended to it. If left or right is negative, then bytes are removed (instead of appended) from the corresponding side of bstr.

Copy operation from one byte sequence to another.

Sourceval blit : t -> src_off:int -> t -> dst_off:int -> len:int -> unit

blit src ~src_off dst ~dst_off ~len copies len bytes from byte sequence src, starting at index src_off, to byte sequence dst, starting at index dst_off. It works correctly even if src and dst are (physically) the same byte sequence, and the source and destination intervals overlap.

  • raises Invalid_argument

    if src_off and len do not designate a valid range of src, or if dst_off and len do not designate a valid range of dst.

Sourceval blit_from_string : string -> src_off:int -> t -> dst_off:int -> len:int -> unit

Just like blit, but with a string as source one.

Note: since it is impossible for src to overlap dst, memcpy is used to do the copy.

  • raises Invalid_argument

    if src_pos and len do not designate a valid range of src, or if dst_off and len do not designate a valid range of dst.

Sourceval blit_from_bytes : bytes -> src_off:int -> t -> dst_off:int -> len:int -> unit

Just like blit, but with a bytes as source one.

Note: since it is impossible for src to overlap dst, memcpy is used to do the copy.

  • raises Invalid_argument

    if src_pos and len do not designate a valid range of src, or if dst_off and len do not designate a valid range of dst.

Sourceval blit_to_bytes : t -> src_off:int -> bytes -> dst_off:int -> len:int -> unit

blit_to_bytes src ~src_off dst ~dst_off ~len copies len bytes from src, starting at index src_off, to byte sequence dst, starting at index dst_off.

Note: since it is impossible for src to overlap dst, memcpy is used to do the copy.

  • raises Invalid_argument

    if src_off and len do not designate a valid range of src, or if dst_off and len do not designate a valid range of dst.

Sourceval memcpy : t -> src_off:int -> t -> dst_off:int -> len:int -> unit

memcpy src ~src_off dst ~dst_off ~len copies len bytes from src to dst. src must not overlap dst. Use memmove if src & dst do overlap.

You can check whether two buffers overlap using overlap. If this returns None, the two values do not refer to a common memory area — and it is safe to use memcpy.

  • raises Invalid_argument

    if src_off and len do not designate a valid range of src, or if dst_off and len do not designate a valid range of dst.

Sourceval memcpy_mmaped : t -> src_off:int -> t -> dst_off:int -> len:int -> unit

memcpy_mmaped is like memcpy but src and dst can be a mmaped bigarray (from Unix.map_file). In this specific case, copying from one to the other can take some time because it involves reading/writing to disk. The operation can take longer than if the two bigarrays were allocated via malloc()/Bigarray.Array1.create.

It may therefore be worthwhile to release the GC lock so that this specific operation can be carried out in parallel (in a Thread) without interruption by the GC.

Note that the bigarrays do not necessarily need to be mmaped. This function also applies to "normal" bigarrays. It may also be worthwhile to use this function if you know that you are copying a large area and would like to do it in parallel (in a Thread).

Sourceval memmove : t -> src_off:int -> t -> dst_off:int -> len:int -> unit

memmove src ~src_off dst ~dst_off ~len copies len bytes from src to dst. src and dst may overlap: copying takes place as though the bytes in src are first copied into a temporary array that does not overlap src or dst, and the bytes are then copied from the temporary array to dst.

  • raises Invalid_argument

    if src_off and len do not designate a valid range of src, or if dst_off and len do not designate a valid range of dst.

Sourceval memmove_mmaped : t -> src_off:int -> t -> dst_off:int -> len:int -> unit

memmove_mmaped is like memmove but src and dst can be a mmaped bigarray (from Unix.map_file). In this specific case, copying from one to the other can take some time because it involves reading/writing to disk. The operation can take longer than if the two bigarrays were allocated via malloc()/Bigarray.Array1.create.

It may therefore be worthwhile to release the GC lock so that this specific operation can be carried out in parallel (in a Thread) without interruption by the GC.

Note that the bigarrays do not necessarily need to be mmaped. This function also applies to "normal" bigarrays. It may also be worthwhile to use this function if you know that you are copying a large area and would like to do it in parallel (in a Thread).

Sourceval memcmp : t -> src_off:int -> t -> dst_off:int -> len:int -> int

memcmp s1 ~src_off s2 ~dst_off ~len compares the first len bytes of the memory areas s1 (starting at src_off) and s2 (starting at dst_off).

memcmp returns 0 is s1 and s2 don't match.

  • raises Invalid_argument

    if src_off and len do not designate a valid range of src, or if dst_off and len do not designate a valid range of dst.

Sourceval memset : t -> off:int -> len:int -> char -> unit

memset t ~off ~len chr fills len bytes (starting at off) into t with the constant byte chr.

Sourceval fill : t -> ?off:int -> ?len:int -> char -> unit

fill t off len chr modifies t in place, replacing len characters with chr, starting at off.

Decode integers from a byte sequence.

Sourceval get_int8 : t -> int -> int

get_int8 bstr i is bstr's signed 8-bit integer starting at byte index i.

Sourceval get_uint8 : t -> int -> int

get_uint8 bstr i is bstr's unsigned 8-bit integer starting at byte index i.

Sourceval get_uint16_ne : t -> int -> int

get_int16_ne bstr i is bstr's native-endian unsigned 16-bit integer starting at byte index i.

Sourceval get_uint16_le : t -> int -> int

get_int16_le bstr i is bstr's little-endian unsigned 16-bit integer starting at byte index i.

Sourceval get_uint16_be : t -> int -> int

get_int16_be bstr i is bstr's big-endian unsigned 16-bit integer starting at byte index i.

Sourceval get_int16_ne : t -> int -> int

get_int16_ne bstr i is bstr's native-endian signed 16-bit integer starting at byte index i.

Sourceval get_int16_le : t -> int -> int

get_int16_le bstr i is bstr's little-endian signed 16-bit integer starting at byte index i.

Sourceval get_int16_be : t -> int -> int

get_int16_be bstr i is bstr's big-endian signed 16-bit integer starting at byte index i.

Sourceval get_int32_ne : t -> int -> int32

get_int32_ne bstr i is bstr's native-endian 32-bit integer starting at byte index i.

Sourceval get_int32_le : t -> int -> int32

get_int32_le bstr i is bstr's little-endian 32-bit integer starting at byte index i.

Sourceval get_int32_be : t -> int -> int32

get_int32_be bstr i is bstr's big-endian 32-bit integer starting at byte index i.

Sourceval get_int64_ne : t -> int -> int64

get_int64_ne bstr i is bstr's native-endian 64-bit integer starting at byte index i.

Sourceval get_int64_le : t -> int -> int64

get_int64_le bstr i is bstr's little-endian 64-bit integer starting at byte index i.

Sourceval get_int64_be : t -> int -> int64

get_int64_be bstr i is bstr's big-endian 64-bit integer starting at byte index i.

Sourceval set_int8 : t -> int -> int -> unit

set_int8 t i v sets t's signed 8-bit integer starting at byte index i to v.

Sourceval set_uint8 : t -> int -> int -> unit

set_uint8 t i v sets t's unsigned 8-bit integer starting at byte index i to v.

Sourceval set_uint16_ne : t -> int -> int -> unit

set_uint16_ne t i v sets t's native-endian unsigned 16-bit integer starting at byte index i to v.

Sourceval set_uint16_le : t -> int -> int -> unit

set_uint16_le t i v sets t's little-endian unsigned 16-bit integer starting at byte index i to v.

Sourceval set_uint16_be : t -> int -> int -> unit

set_uint16_le t i v sets t's big-endian unsigned 16-bit integer starting at byte index i to v.

Sourceval set_int16_ne : t -> int -> int -> unit

set_uint16_ne t i v sets t's native-endian signed 16-bit integer starting at byte index i to v.

Sourceval set_int16_le : t -> int -> int -> unit

set_uint16_le t i v sets t's little-endian signed 16-bit integer starting at byte index i to v.

Sourceval set_int16_be : t -> int -> int -> unit

set_uint16_le t i v sets t's big-endian signed 16-bit integer starting at byte index i to v.

Sourceval set_int32_ne : t -> int -> int32 -> unit

set_int32_ne t i v sets t's native-endian 32-bit integer starting at byte index i to v.

Sourceval set_int32_le : t -> int -> int32 -> unit

set_int32_ne t i v sets t's little-endian 32-bit integer starting at byte index i to v.

Sourceval set_int32_be : t -> int -> int32 -> unit

set_int32_ne t i v sets t's big-endian 32-bit integer starting at byte index i to v.

Sourceval set_int64_ne : t -> int -> int64 -> unit

set_int32_ne t i v sets t's native-endian 64-bit integer starting at byte index i to v.

Sourceval set_int64_le : t -> int -> int64 -> unit

set_int32_ne t i v sets t's little-endian 64-bit integer starting at byte index i to v.

Sourceval set_int64_be : t -> int -> int64 -> unit

set_int32_ne t i v sets t's big-endian 64-bit integer starting at byte index i to v.

Sourceval sub : t -> off:int -> len:int -> t

sub bstr ~off ~len does not allocate a bigstring, but instead returns a new view into bstr starting at off, and with length len.

Note sub does not allocate a new buffer, but instead shares the memory area of bstr with the newly-returned bigstring. This means that the changes (set{,_*} functions) made to the returned bigstring will also be reflected in the bstr bigstring given.

Note sub is more expensive than a Slice.sub (about 8 times slower). If you want to focus on performance while avoiding copying, it's best to use a Slice.

Sourceval shift : t -> int -> t

shift bstr n is sub bstr n (length bstr - n) (see sub for more details).

Sourceval overlap : t -> t -> (int * int * int) option

overlap x y returns the size (in bytes) of what is physically common between x and y, as well as the position of y in x and the position of x in y.

Predicates and comparaisons.

Sourceval is_empty : t -> bool

is_empty bstr is length bstr = 0.

Sourceval is_prefix : affix:string -> t -> bool

is_prefix ~affix bstr is true iff affix.[idx] = bstr.{idx} for all indices idx of affix.

Sourceval starts_with : prefix:t -> t -> bool

starts_with ~prefix t is like is_prefix but the prefix is a t (instead of a string).

Sourceval is_infix : affix:string -> t -> bool

is_infix ~affix bstr is true iff there exists an index j in bstr such that for all indices i of affix we have affix.[i] = bstr.{j + i}.

Sourceval is_suffix : affix:string -> t -> bool

is_suffix ~affix bstr is true iff affix.[n - idx] = bstr.{m - idx} for all indices idx of affix with n = String.length affix - 1 and m = length bstr - 1.

Sourceval ends_with : suffix:t -> t -> bool

ends_with ~suffix t is like is_suffix but the suffix is a t (instead of a string.

Sourceval for_all : (char -> bool) -> t -> bool

for_all p bstr is true iff for all indices idx of bstr, p bstr.{idx} = true.

Sourceval contains : t -> ?off:int -> ?len:int -> char -> bool

contains bstr ?off ?len chr is true if and only if chr appears in len byte(s)'s bstr after position off (defaults to 0).

Sourceval equal : t -> t -> bool

equal a b is a = b.

Sourceval constant_equal : t -> t -> bool

constant_equal gives the same result as equal but the execution time of the function, whether or not the two values are equivalent (as long as they have the same size) is the same.

Indeed, the equal function ends as soon as a difference exists. This function continues even if a difference exists. This function is useful when comparing passwords — and avoiding an timing attack.

Sourceval compare : t -> t -> int

compare bstr0 bstr1 sorts bstr0 and bstr1 in lexicographical order.

Sourceval index : t -> ?off:int -> ?len:int -> char -> int option

index bstr ?off ?len chr is the index of the first occurrence of chr in len byte(s)'s bstr after position off (defaults to 0). If chr does not occur in given range of bstr, we return None.

Sourceval memchr : t -> off:int -> len:int -> char -> int

memchr t ~off ~len chr scans len bytes (starting at off) of t for the first instance of chr. It returns the position in t where the first occurrence of chr is found. Otherwise, it returns -1.

Extracting substrings.

Sourceval with_range : ?first:int -> ?len:int -> t -> t

with_range ~first ~len bstr are the consecutive bytes of bstr whose indices exist in the range [first;first + len - 1].

first defaults to 0 and len to max_int. Note that first can be any integer and len any positive integer.

Sourceval with_index_range : ?first:int -> ?last:int -> t -> t

with_index_range ~first ~last bstr are the consecutive bytes of bstr whose indices exists in the range [first;last].

first defaults to 0 and last to length bstr - 1.

Note that both first and last can be any integer. If first > last the interval is empty and the empty bigstring is returned.

Sourceval trim : ?drop:(char -> bool) -> t -> t

trim ~drop bstr is bstr with prefix and suffix bytes satisfying drop in bstr removed. drop defaults to fun chr -> chr = ' '.

Sourceval span : ?rev:bool -> ?min:int -> ?max:int -> ?sat:(char -> bool) -> t -> t * t

span ~rev ~min ~max ~sat bstr is (l, r) where:

  • if rev is false (default), l is at least min and at most max consecutive sat satisfying initial bytes of bstr or empty if there are no such bytes. r are the remaining bytes of bstr.
  • if rev is true, r is at least min and at most max consecutive sat satisfying final bytes of bstr or empty if there are no such bytes. l are the remaining bytes of bstr.

If max is unspecified the span is unlimited. If min is unspecified it defaults to 0. If min > max the condition can't be satisfied and the left or right span, depending on rev, is always empty. sat defaults to Fun.const true.

Sourceval take : ?rev:bool -> ?min:int -> ?max:int -> ?sat:(char -> bool) -> t -> t

take ~rev ~min ~max ~sat bstr is the matching span of span without the remaining one. In other words:

  (if rev then snd else fst) (span ~rev ~min ~max ~sat bstr)
Sourceval drop : ?rev:bool -> ?min:int -> ?max:int -> ?sat:(char -> bool) -> t -> t

drop ~rev ~min ~max ~sat bstr is the remaining span of span without the matching span. In other words:

  (if rev then fst else snd) (span ~rev ~min ~max ~sat bstr)
Sourceval cut : ?rev:bool -> sep:string -> t -> (t * t) option

cut ~sep bstr is either the pair Some (l, r) of the two (possibly empty) sub-buffers of bstr that are delimited by the first match of the non empty separator string sep or None if sep can't be matched in bstr. Matching starts from the beginning of bstr (rev is false, default) or the end (rev is true).

The invariant l ^ sep ^ r = s holds.

For instance, the ABNF expression:

field_name := *PRINT
field_value := *ASCII
field := field_name ":" field_value

can be translated to:

  match Bstr.cut ~sep:":" value with
  | Some (field_name, field_value) -> ...
  | None -> invalid_arg "Invalid field"
Sourceval split_on_char : char -> t -> t list

split_on_char sep t is the list of all (possibly empty) sub-bigstrings of t that are delimited by the character sep. If t is empty, the result is the singleton list [empty].

The function's result is specified by the following invariant:

  • the list is not empty.
  • concatenating its elements using sep as a separator returns a bigstring equal to the input.
  • no bigstring in the result contains the sep character.

Traversing strings.

Sourceval iter : (char -> unit) -> t -> unit

iter fn t applies function fn in turn to all the characters of t. It is equivalent to fn t.{0}; fn t.{1}; ...; fn t.{length t - 1}; ().

Sourceval to_seq : t -> char Seq.t

Iterate on the bigstring, in increasing index order. Modifications of the bigstring during iteration will be reflected in the sequence.

Sourceval to_seqi : t -> (int * char) Seq.t

Iterate on the bigstring, in increasing order, yielding indices along chars.

Sourceval of_seq : char Seq.t -> t

Create a bigstring from the generator.