Page
Library
Module
Module type
Parameter
Class
Class type
Source
BstrSourceA small library for manipulating bigstrings.
A bigstring is a mutable data structure that contains a fixed-length sequence of bytes. Each byte can be indexed in constant time for reading and writing.
Given a byte sequence bstr of length len, we can access each of the len bytes of bstr via its index in the sequence. Indexes start at 0, and will call an index valid in bstr if it falls within the range [0...len-1] (inclusive). A position is the point between two bytes or at the beginning or end of the sequence. We call a position valid in bstr if it falls within the range [0...len] (inclusive). Note that byte at index n is between positions n and n+1.
Two parameters off and len are said to designate a valid range of bstr if len >= 0 and off and off+len are valid positions in bstr.
Byte sequences can be modified in place, for instance via the set and blit functions described below.
Bigstring is a specialised version of Bigarray that not only handles bytes in the form of Character but also imposes a "C-like" (see Bigarray.c_layout) view as described above and allows common functions such as memcpy(3) or memmove(3) to be offered.
For more details about Bigstrings and Bigarrays, we invite you to read the Bigarray documentation, which offers more general functions that can be applied to Bigstrings.
Like bytes, a bigstring is a mutable data structure that contains a fixed-length sequence of bytes. However, a bigstring has a few special features that can make it more interesting to use than bytes.
A bigstring is not allocated in the same way as a standard OCaml value. In fact, the byte sequence that the bigstring refers to is found in the C heap (rather than the OCaml heap). This means that the byte sequence can come from a malloc(3) or a function requesting a particular memory area from the system such as Unix.map_file.
This particularity has an implication with the GC: the byte sequence is not relocatable. That is to say that during the cycle of the Garbage Collector, this byte sequence does not move — in contrast, a bytes can be moved by the GC (typically, from the minor heap to the major heap).
Thus, bigstrings have advantages and disadvantages compared to bytes due to this particularity:
malloc(3)/create or Unix.map_file, creating a bigstring will always be more expensive than creating bytes with OCaml. For small byte sequences, it is therefore preferable to use bytes.Threads and/or Domains without interacting with the GC. An example is being able to perform a complex computation in parallel from the bytes of this sequence without blocking the Garbage Collector during this computation.Depending on these characteristics, it may be more advantageous to use a bigstring rather than bytes. This basically depends on your usage, and the special features of bigstrings can unlock opportunities to outperform byte calculations or analysis.
Another advantage of bigstrings is that copying is avoided when extracting part of a larger bigstring. This is because the sub function returns a "proxy" of the original bigstring.
In this respect, and to be very precise, sub avoids copying but the creation of this "proxy" remains costly. In addition, this library is distributed with a new Slice_bstr module. The latter offers a new type whose Slice_bstr.sub function is much less costly than sub.
create len returns a new byte sequence of length len. The sequence is unitialized and contains arbitrary bytes.
make len chr is t of length len with each index holding the character chr.
copy t returns a new byte sequence that contains the same bytes as the argument.
init len fn returns a fresh byte sequence of length len, with character idx initialized to the result of fn idx (in increasing index order).
of_string str returns a new t that contains the contents of the given string str.
string ~off ~len str is the sub-buffer of str that starts at position off (defaults to 0) and stops at position off + len (defaults to String.length str). str is fully-replaced by a fresh allocated t.
sub_string bstr ~off ~len returns a string of length len containing the bytes of bstr starting at off.
to_string bstr is equivalent to sub_string bstr ~off:0 ~len:(length bstr).
get bstr i is the byte of bstr' at index i. This is equivalent to the bstr.{i} notation.
set t i chr modifies t in place, replacing the byte at index i with chr.
unsafe_get t idx is like get except no bounds checking is performed.
unsafe_set t idx chr is like set except no bounds checking is performed.
chop bstr returns the first element of bstr or the last element if rev = true. If bstr is empty, it returns None.
concat sep ts concatenates the list of bigstrings ts, inserting the separator string sep between each.
extend bstr left right returns a new bigstring that contains the bytes of bstr, with left zero bytes prepended and right zero byte appended to it. If left or right is negative, then bytes are removed (instead of appended) from the corresponding side of bstr.
blit src ~src_off dst ~dst_off ~len copies len bytes from byte sequence src, starting at index src_off, to byte sequence dst, starting at index dst_off. It works correctly even if src and dst are (physically) the same byte sequence, and the source and destination intervals overlap.
blit_to_bytes src ~src_off dst ~dst_off ~len copies len bytes from src, starting at index src_off, to byte sequence dst, starting at index dst_off.
Note: since it is impossible for src to overlap dst, memcpy is used to do the copy.
memcpy_mmaped is like memcpy but src and dst can be a mmaped bigarray (from Unix.map_file). In this specific case, copying from one to the other can take some time because it involves reading/writing to disk. The operation can take longer than if the two bigarrays were allocated via malloc()/Bigarray.Array1.create.
It may therefore be worthwhile to release the GC lock so that this specific operation can be carried out in parallel (in a Thread) without interruption by the GC.
Note that the bigarrays do not necessarily need to be mmaped. This function also applies to "normal" bigarrays. It may also be worthwhile to use this function if you know that you are copying a large area and would like to do it in parallel (in a Thread).
memmove src ~src_off dst ~dst_off ~len copies len bytes from src to dst. src and dst may overlap: copying takes place as though the bytes in src are first copied into a temporary array that does not overlap src or dst, and the bytes are then copied from the temporary array to dst.
memmove_mmaped is like memmove but src and dst can be a mmaped bigarray (from Unix.map_file). In this specific case, copying from one to the other can take some time because it involves reading/writing to disk. The operation can take longer than if the two bigarrays were allocated via malloc()/Bigarray.Array1.create.
It may therefore be worthwhile to release the GC lock so that this specific operation can be carried out in parallel (in a Thread) without interruption by the GC.
Note that the bigarrays do not necessarily need to be mmaped. This function also applies to "normal" bigarrays. It may also be worthwhile to use this function if you know that you are copying a large area and would like to do it in parallel (in a Thread).
memcmp s1 ~src_off s2 ~dst_off ~len compares the first len bytes of the memory areas s1 (starting at src_off) and s2 (starting at dst_off).
memcmp returns 0 is s1 and s2 don't match.
memset t ~off ~len chr fills len bytes (starting at off) into t with the constant byte chr.
fill t off len chr modifies t in place, replacing len characters with chr, starting at off.
get_int8 bstr i is bstr's signed 8-bit integer starting at byte index i.
get_uint8 bstr i is bstr's unsigned 8-bit integer starting at byte index i.
get_int16_ne bstr i is bstr's native-endian unsigned 16-bit integer starting at byte index i.
get_int16_le bstr i is bstr's little-endian unsigned 16-bit integer starting at byte index i.
get_int16_be bstr i is bstr's big-endian unsigned 16-bit integer starting at byte index i.
get_int16_ne bstr i is bstr's native-endian signed 16-bit integer starting at byte index i.
get_int16_le bstr i is bstr's little-endian signed 16-bit integer starting at byte index i.
get_int16_be bstr i is bstr's big-endian signed 16-bit integer starting at byte index i.
get_int32_ne bstr i is bstr's native-endian 32-bit integer starting at byte index i.
get_int32_le bstr i is bstr's little-endian 32-bit integer starting at byte index i.
get_int32_be bstr i is bstr's big-endian 32-bit integer starting at byte index i.
get_int64_ne bstr i is bstr's native-endian 64-bit integer starting at byte index i.
get_int64_le bstr i is bstr's little-endian 64-bit integer starting at byte index i.
get_int64_be bstr i is bstr's big-endian 64-bit integer starting at byte index i.
set_int8 t i v sets t's signed 8-bit integer starting at byte index i to v.
set_uint8 t i v sets t's unsigned 8-bit integer starting at byte index i to v.
set_uint16_ne t i v sets t's native-endian unsigned 16-bit integer starting at byte index i to v.
set_uint16_le t i v sets t's little-endian unsigned 16-bit integer starting at byte index i to v.
set_uint16_le t i v sets t's big-endian unsigned 16-bit integer starting at byte index i to v.
set_uint16_ne t i v sets t's native-endian signed 16-bit integer starting at byte index i to v.
set_uint16_le t i v sets t's little-endian signed 16-bit integer starting at byte index i to v.
set_uint16_le t i v sets t's big-endian signed 16-bit integer starting at byte index i to v.
set_int32_ne t i v sets t's native-endian 32-bit integer starting at byte index i to v.
set_int32_ne t i v sets t's little-endian 32-bit integer starting at byte index i to v.
set_int32_ne t i v sets t's big-endian 32-bit integer starting at byte index i to v.
set_int32_ne t i v sets t's native-endian 64-bit integer starting at byte index i to v.
set_int32_ne t i v sets t's little-endian 64-bit integer starting at byte index i to v.
set_int32_ne t i v sets t's big-endian 64-bit integer starting at byte index i to v.
sub bstr ~off ~len does not allocate a bigstring, but instead returns a new view into bstr starting at off, and with length len.
Note sub does not allocate a new buffer, but instead shares the memory area of bstr with the newly-returned bigstring. This means that the changes (set{,_*} functions) made to the returned bigstring will also be reflected in the bstr bigstring given.
Note sub is more expensive than a Slice.sub (about 8 times slower). If you want to focus on performance while avoiding copying, it's best to use a Slice.
shift bstr n is sub bstr n (length bstr - n) (see sub for more details).
overlap x y returns the size (in bytes) of what is physically common between x and y, as well as the position of y in x and the position of x in y.
is_prefix ~affix bstr is true iff affix.[idx] = bstr.{idx} for all indices idx of affix.
is_infix ~affix bstr is true iff there exists an index j in bstr such that for all indices i of affix we have affix.[i] = bstr.{j + i}.
is_suffix ~affix bstr is true iff affix.[n - idx] = bstr.{m - idx} for all indices idx of affix with n = String.length affix - 1 and m = length bstr - 1.
for_all p bstr is true iff for all indices idx of bstr, p bstr.{idx} = true.
contains bstr ?off ?len chr is true if and only if chr appears in len byte(s)'s bstr after position off (defaults to 0).
constant_equal gives the same result as equal but the execution time of the function, whether or not the two values are equivalent (as long as they have the same size) is the same.
Indeed, the equal function ends as soon as a difference exists. This function continues even if a difference exists. This function is useful when comparing passwords — and avoiding an timing attack.
compare bstr0 bstr1 sorts bstr0 and bstr1 in lexicographical order.
index bstr ?off ?len chr is the index of the first occurrence of chr in len byte(s)'s bstr after position off (defaults to 0). If chr does not occur in given range of bstr, we return None.
memchr t ~off ~len chr scans len bytes (starting at off) of t for the first instance of chr. It returns the position in t where the first occurrence of chr is found. Otherwise, it returns -1.
with_range ~first ~len bstr are the consecutive bytes of bstr whose indices exist in the range [first;first + len - 1].
first defaults to 0 and len to max_int. Note that first can be any integer and len any positive integer.
with_index_range ~first ~last bstr are the consecutive bytes of bstr whose indices exists in the range [first;last].
first defaults to 0 and last to length bstr - 1.
Note that both first and last can be any integer. If first > last the interval is empty and the empty bigstring is returned.
trim ~drop bstr is bstr with prefix and suffix bytes satisfying drop in bstr removed. drop defaults to fun chr -> chr = ' '.
span ~rev ~min ~max ~sat bstr is (l, r) where:
rev is false (default), l is at least min and at most max consecutive sat satisfying initial bytes of bstr or empty if there are no such bytes. r are the remaining bytes of bstr.rev is true, r is at least min and at most max consecutive sat satisfying final bytes of bstr or empty if there are no such bytes. l are the remaining bytes of bstr.If max is unspecified the span is unlimited. If min is unspecified it defaults to 0. If min > max the condition can't be satisfied and the left or right span, depending on rev, is always empty. sat defaults to Fun.const true.
take ~rev ~min ~max ~sat bstr is the matching span of span without the remaining one. In other words:
  (if rev then snd else fst) (span ~rev ~min ~max ~sat bstr)drop ~rev ~min ~max ~sat bstr is the remaining span of span without the matching span. In other words:
  (if rev then fst else snd) (span ~rev ~min ~max ~sat bstr)cut ~sep bstr is either the pair Some (l, r) of the two (possibly empty) sub-buffers of bstr that are delimited by the first match of the non empty separator string sep or None if sep can't be matched in bstr. Matching starts from the beginning of bstr (rev is false, default) or the end (rev is true).
The invariant l ^ sep ^ r = s holds.
For instance, the ABNF expression:
field_name := *PRINT field_value := *ASCII field := field_name ":" field_value
can be translated to:
  match Bstr.cut ~sep:":" value with
  | Some (field_name, field_value) -> ...
  | None -> invalid_arg "Invalid field"split_on_char sep t is the list of all (possibly empty) sub-bigstrings of t that are delimited by the character sep. If t is empty, the result is the singleton list [empty].
The function's result is specified by the following invariant:
sep as a separator returns a bigstring equal to the input.sep character.iter fn t applies function fn in turn to all the characters of t. It is equivalent to fn t.{0}; fn t.{1}; ...; fn t.{length t - 1}; ().
Iterate on the bigstring, in increasing index order. Modifications of the bigstring during iteration will be reflected in the sequence.
Iterate on the bigstring, in increasing order, yielding indices along chars.