Page
Library
Module
Module type
Parameter
Class
Class type
Source
Bstr
SourceA small library for manipulating bigstrings.
A bigstring is a mutable data structure that contains a fixed-length sequence of bytes. Each byte can be indexed in constant time for reading and writing.
Given a byte sequence bstr
of length len
, we can access each of the len
bytes of bstr
via its index in the sequence. Indexes start at 0
, and will call an index valid in bstr
if it falls within the range [0...len-1]
(inclusive). A position is the point between two bytes or at the beginning or end of the sequence. We call a position valid in bstr
if it falls within the range [0...len]
(inclusive). Note that byte at index n
is between positions n
and n+1
.
Two parameters off
and len
are said to designate a valid range of bstr
if len >= 0
and off
and off+len
are valid positions in bstr
.
Byte sequences can be modified in place, for instance via the set
and blit
functions described below.
Bigstring is a specialised version of Bigarray
that not only handles bytes in the form of Char
acter but also imposes a "C-like" (see Bigarray.c_layout
) view as described above and allows common functions such as memcpy(3)
or memmove(3)
to be offered.
For more details about Bigstrings and Bigarrays, we invite you to read the Bigarray
documentation, which offers more general functions that can be applied to Bigstrings.
Like bytes, a bigstring is a mutable data structure that contains a fixed-length sequence of bytes. However, a bigstring has a few special features that can make it more interesting to use than bytes.
A bigstring is not allocated in the same way as a standard OCaml value. In fact, the byte sequence that the bigstring refers to is found in the C heap (rather than the OCaml heap). This means that the byte sequence can come from a malloc(3)
or a function requesting a particular memory area from the system such as Unix.map_file
.
This particularity has an implication with the GC: the byte sequence is not relocatable. That is to say that during the cycle of the Garbage Collector, this byte sequence does not move — in contrast, a bytes
can be moved by the GC (typically, from the minor heap to the major heap).
Thus, bigstrings have advantages and disadvantages compared to bytes due to this particularity:
malloc(3)
/create
or Unix.map_file
, creating a bigstring will always be more expensive than creating bytes with OCaml. For small byte sequences, it is therefore preferable to use bytes.Thread
s and/or Domain
s without interacting with the GC. An example is being able to perform a complex computation in parallel from the bytes of this sequence without blocking the Garbage Collector during this computation.Depending on these characteristics, it may be more advantageous to use a bigstring rather than bytes
. This basically depends on your usage, and the special features of bigstrings can unlock opportunities to outperform byte calculations or analysis.
Another advantage of bigstrings is that copying is avoided when extracting part of a larger bigstring. This is because the sub
function returns a "proxy" of the original bigstring.
In this respect, and to be very precise, sub
avoids copying but the creation of this "proxy" remains costly. In addition, this library is distributed with a new Slice_bstr
module. The latter offers a new type whose Slice_bstr.sub
function is much less costly than sub
.
create len
returns a new byte sequence of length len
. The sequence is unitialized and contains arbitrary bytes.
make len chr
is t
of length len
with each index holding the character chr
.
copy t
returns a new byte sequence that contains the same bytes as the argument.
init len fn
returns a fresh byte sequence of length len
, with character idx
initialized to the result of fn idx
(in increasing index order).
of_string str
returns a new t
that contains the contents of the given string str
.
string ~off ~len str
is the sub-buffer of str
that starts at position off
(defaults to 0
) and stops at position off + len
(defaults to String.length str
). str
is fully-replaced by a fresh allocated t
.
sub_string bstr ~off ~len
returns a string of length len
containing the bytes of bstr
starting at off
.
to_string bstr
is equivalent to sub_string bstr ~off:0 ~len:(length bstr)
.
get bstr i
is the byte of bstr
' at index i
. This is equivalent to the bstr.{i}
notation.
set t i chr
modifies t
in place, replacing the byte at index i
with chr
.
unsafe_get t idx
is like get
except no bounds checking is performed.
unsafe_set t idx chr
is like set
except no bounds checking is performed.
chop bstr
returns the first element of bstr
or the last element if rev = true
. If bstr
is empty, it returns None
.
concat sep ts
concatenates the list of bigstrings ts
, inserting the separator string sep
between each.
extend bstr left right
returns a new bigstring that contains the bytes of bstr
, with left
zero bytes prepended and right
zero byte appended to it. If left
or right
is negative, then bytes are removed (instead of appended) from the corresponding side of bstr
.
blit src ~src_off dst ~dst_off ~len
copies len
bytes from byte sequence src
, starting at index src_off
, to byte sequence dst
, starting at index dst_off
. It works correctly even if src
and dst
are (physically) the same byte sequence, and the source and destination intervals overlap.
blit_to_bytes src ~src_off dst ~dst_off ~len
copies len
bytes from src
, starting at index src_off
, to byte sequence dst
, starting at index dst_off
.
Note: since it is impossible for src
to overlap dst
, memcpy
is used to do the copy.
memcpy_mmaped
is like memcpy
but src
and dst
can be a mmaped bigarray (from Unix.map_file
). In this specific case, copying from one to the other can take some time because it involves reading/writing to disk. The operation can take longer than if the two bigarrays were allocated via malloc()
/Bigarray.Array1.create
.
It may therefore be worthwhile to release the GC lock so that this specific operation can be carried out in parallel (in a Thread
) without interruption by the GC.
Note that the bigarrays do not necessarily need to be mmaped. This function also applies to "normal" bigarrays. It may also be worthwhile to use this function if you know that you are copying a large area and would like to do it in parallel (in a Thread
).
memmove src ~src_off dst ~dst_off ~len
copies len
bytes from src
to dst
. src
and dst
may overlap: copying takes place as though the bytes in src
are first copied into a temporary array that does not overlap src
or dst
, and the bytes are then copied from the temporary array to dst
.
memmove_mmaped
is like memmove
but src
and dst
can be a mmaped bigarray (from Unix.map_file
). In this specific case, copying from one to the other can take some time because it involves reading/writing to disk. The operation can take longer than if the two bigarrays were allocated via malloc()
/Bigarray.Array1.create
.
It may therefore be worthwhile to release the GC lock so that this specific operation can be carried out in parallel (in a Thread
) without interruption by the GC.
Note that the bigarrays do not necessarily need to be mmaped. This function also applies to "normal" bigarrays. It may also be worthwhile to use this function if you know that you are copying a large area and would like to do it in parallel (in a Thread
).
memcmp s1 ~src_off s2 ~dst_off ~len
compares the first len
bytes of the memory areas s1
(starting at src_off
) and s2
(starting at dst_off
).
memcmp
returns 0
is s1
and s2
don't match.
memset t ~off ~len chr
fills len
bytes (starting at off
) into t
with the constant byte chr
.
fill t off len chr
modifies t
in place, replacing len
characters with chr
, starting at off
.
get_int8 bstr i
is bstr
's signed 8-bit integer starting at byte index i
.
get_uint8 bstr i
is bstr
's unsigned 8-bit integer starting at byte index i
.
get_int16_ne bstr i
is bstr
's native-endian unsigned 16-bit integer starting at byte index i
.
get_int16_le bstr i
is bstr
's little-endian unsigned 16-bit integer starting at byte index i
.
get_int16_be bstr i
is bstr
's big-endian unsigned 16-bit integer starting at byte index i
.
get_int16_ne bstr i
is bstr
's native-endian signed 16-bit integer starting at byte index i
.
get_int16_le bstr i
is bstr
's little-endian signed 16-bit integer starting at byte index i
.
get_int16_be bstr i
is bstr
's big-endian signed 16-bit integer starting at byte index i
.
get_int32_ne bstr i
is bstr
's native-endian 32-bit integer starting at byte index i
.
get_int32_le bstr i
is bstr
's little-endian 32-bit integer starting at byte index i
.
get_int32_be bstr i
is bstr
's big-endian 32-bit integer starting at byte index i
.
get_int64_ne bstr i
is bstr
's native-endian 64-bit integer starting at byte index i
.
get_int64_le bstr i
is bstr
's little-endian 64-bit integer starting at byte index i
.
get_int64_be bstr i
is bstr
's big-endian 64-bit integer starting at byte index i
.
set_int8 t i v
sets t
's signed 8-bit integer starting at byte index i
to v
.
set_uint8 t i v
sets t
's unsigned 8-bit integer starting at byte index i
to v
.
set_uint16_ne t i v
sets t
's native-endian unsigned 16-bit integer starting at byte index i
to v
.
set_uint16_le t i v
sets t
's little-endian unsigned 16-bit integer starting at byte index i
to v
.
set_uint16_le t i v
sets t
's big-endian unsigned 16-bit integer starting at byte index i
to v
.
set_uint16_ne t i v
sets t
's native-endian signed 16-bit integer starting at byte index i
to v
.
set_uint16_le t i v
sets t
's little-endian signed 16-bit integer starting at byte index i
to v
.
set_uint16_le t i v
sets t
's big-endian signed 16-bit integer starting at byte index i
to v
.
set_int32_ne t i v
sets t
's native-endian 32-bit integer starting at byte index i
to v
.
set_int32_ne t i v
sets t
's little-endian 32-bit integer starting at byte index i
to v
.
set_int32_ne t i v
sets t
's big-endian 32-bit integer starting at byte index i
to v
.
set_int32_ne t i v
sets t
's native-endian 64-bit integer starting at byte index i
to v
.
set_int32_ne t i v
sets t
's little-endian 64-bit integer starting at byte index i
to v
.
set_int32_ne t i v
sets t
's big-endian 64-bit integer starting at byte index i
to v
.
sub bstr ~off ~len
does not allocate a bigstring, but instead returns a new view into bstr
starting at off
, and with length len
.
Note sub
does not allocate a new buffer, but instead shares the memory area of bstr
with the newly-returned bigstring. This means that the changes (set{,_*}
functions) made to the returned bigstring will also be reflected in the bstr
bigstring given.
Note sub
is more expensive than a Slice.sub
(about 8 times slower). If you want to focus on performance while avoiding copying, it's best to use a Slice
.
shift bstr n
is sub bstr n (length bstr - n)
(see sub
for more details).
overlap x y
returns the size (in bytes) of what is physically common between x
and y
, as well as the position of y
in x
and the position of x
in y
.
is_prefix ~affix bstr
is true
iff affix.[idx] = bstr.{idx}
for all indices idx
of affix
.
is_infix ~affix bstr
is true
iff there exists an index j
in bstr
such that for all indices i
of affix
we have affix.[i] = bstr.{j + i}
.
is_suffix ~affix bstr
is true
iff affix.[n - idx] = bstr.{m - idx}
for all indices idx
of affix
with n = String.length affix - 1
and m = length bstr - 1
.
for_all p bstr
is true
iff for all indices idx
of bstr
, p bstr.{idx} = true
.
contains bstr ?off ?len chr
is true
if and only if chr
appears in len
byte(s)'s bstr
after position off
(defaults to 0
).
constant_equal
gives the same result as equal
but the execution time of the function, whether or not the two values are equivalent (as long as they have the same size) is the same.
Indeed, the equal
function ends as soon as a difference exists. This function continues even if a difference exists. This function is useful when comparing passwords — and avoiding an timing attack.
compare bstr0 bstr1
sorts bstr0
and bstr1
in lexicographical order.
index bstr ?off ?len chr
is the index of the first occurrence of chr
in len
byte(s)'s bstr
after position off
(defaults to 0
). If chr
does not occur in given range of bstr
, we return None
.
memchr t ~off ~len chr
scans len
bytes (starting at off
) of t
for the first instance of chr
. It returns the position in t
where the first occurrence of chr
is found. Otherwise, it returns -1
.
with_range ~first ~len bstr
are the consecutive bytes of bstr
whose indices exist in the range [first
;first + len - 1
].
first
defaults to 0
and len
to max_int
. Note that first
can be any integer and len
any positive integer.
with_index_range ~first ~last bstr
are the consecutive bytes of bstr
whose indices exists in the range [first
;last
].
first
defaults to 0
and last
to length bstr - 1
.
Note that both first
and last
can be any integer. If first > last
the interval is empty and the empty bigstring is returned.
trim ~drop bstr
is bstr
with prefix and suffix bytes satisfying drop
in bstr
removed. drop
defaults to fun chr -> chr = ' '
.
span ~rev ~min ~max ~sat bstr
is (l, r)
where:
rev
is false
(default), l
is at least min
and at most max
consecutive sat
satisfying initial bytes of bstr
or empty
if there are no such bytes. r
are the remaining bytes of bstr
.rev
is true
, r
is at least min
and at most max
consecutive sat
satisfying final bytes of bstr
or empty
if there are no such bytes. l
are the remaining bytes of bstr
.If max
is unspecified the span is unlimited. If min
is unspecified it defaults to 0
. If min > max
the condition can't be satisfied and the left or right span, depending on rev
, is always empty. sat
defaults to Fun.const true
.
take ~rev ~min ~max ~sat bstr
is the matching span of span
without the remaining one. In other words:
(if rev then snd else fst) (span ~rev ~min ~max ~sat bstr)
drop ~rev ~min ~max ~sat bstr
is the remaining span of span
without the matching span. In other words:
(if rev then fst else snd) (span ~rev ~min ~max ~sat bstr)
cut ~sep bstr
is either the pair Some (l, r)
of the two (possibly empty) sub-buffers of bstr
that are delimited by the first match of the non empty separator string sep
or None
if sep
can't be matched in bstr
. Matching starts from the beginning of bstr
(rev
is false
, default) or the end (rev
is true
).
The invariant l ^ sep ^ r = s
holds.
For instance, the ABNF expression:
field_name := *PRINT field_value := *ASCII field := field_name ":" field_value
can be translated to:
match Bstr.cut ~sep:":" value with
| Some (field_name, field_value) -> ...
| None -> invalid_arg "Invalid field"
split_on_char sep t
is the list of all (possibly empty) sub
-bigstrings of t
that are delimited by the character sep
. If t
is empty, the result is the singleton list [empty]
.
The function's result is specified by the following invariant:
sep
as a separator returns a bigstring equal to the input.sep
character.iter fn t
applies function fn
in turn to all the characters of t
. It is equivalent to fn t.{0}; fn t.{1}; ...; fn t.{length t - 1}; ()
.
Iterate on the bigstring, in increasing index order. Modifications of the bigstring during iteration will be reflected in the sequence.
Iterate on the bigstring, in increasing order, yielding indices along chars.