package b0
Install
dune-project
Dependency
Authors
Maintainers
Sources
sha512=e9aa779e66c08fc763019f16d4706f465d16c05d6400b58fbd0313317ef33ddea51952e2b058db28e65f7ddb7012f328c8bf02d8f1da17bb543348541a2587f0
doc/b0.std/B0_std/String/index.html
Module B0_std.String
Source
Strings.
include module type of String
Strings
The type for strings.
make n c
is a string of length n
with each index holding the character c
.
init n f
is a string of length n
with index i
holding the character f i
(called in increasing index order).
length s
is the length (number of bytes/characters) of s
.
get s i
is the character at index i
in s
. This is the same as writing s.[i]
.
Return a new string that contains the same bytes as the given byte sequence.
Return a new byte sequence that contains the same bytes as the given string.
Same as Bytes.blit_string
which should be preferred.
Concatenating
Note. The Stdlib.(^)
binary operator concatenates two strings.
concat sep ss
concatenates the list of strings ss
, inserting the separator string sep
between each.
cat s1 s2
concatenates s1 and s2 (s1 ^ s2
).
Predicates and comparisons
equal s0 s1
is true
if and only if s0
and s1
are character-wise equal.
compare s0 s1
sorts s0
and s1
in lexicographical order. compare
behaves like Stdlib.compare
on strings but may be more efficient.
starts_with
~prefix s
is true
if and only if s
starts with prefix
.
ends_with
~suffix s
is true
if and only if s
ends with suffix
.
contains_from s start c
is true
if and only if c
appears in s
after position start
.
rcontains_from s stop c
is true
if and only if c
appears in s
before position stop+1
.
contains s c
is String.contains_from
s 0 c
.
Extracting substrings
sub s pos len
is a string of length len
, containing the substring of s
that starts at position pos
and has length len
.
split_on_char sep s
is the list of all (possibly empty) substrings of s
that are delimited by the character sep
. If s
is empty, the result is the singleton list [""]
.
The function's result is specified by the following invariants:
- The list is not empty.
- Concatenating its elements using
sep
as a separator returns a string equal to the input (concat (make 1 sep) (split_on_char sep s) = s
). - No string in the result contains the
sep
character.
Transforming
map f s
is the string resulting from applying f
to all the characters of s
in increasing order.
mapi f s
is like map
but the index of the character is also passed to f
.
fold_left f x s
computes f (... (f (f x s.[0]) s.[1]) ...) s.[n-1]
, where n
is the length of the string s
.
fold_right f s x
computes f s.[0] (f s.[1] ( ... (f s.[n-1] x) ...))
, where n
is the length of the string s
.
for_all p s
checks if all characters in s
satisfy the predicate p
.
exists p s
checks if at least one character of s
satisfies the predicate p
.
trim s
is s
without leading and trailing whitespace. Whitespace characters are: ' '
, '\x0C'
(form feed), '\n'
, '\r'
, and '\t'
.
escaped s
is s
with special characters represented by escape sequences, following the lexical conventions of OCaml.
All characters outside the US-ASCII printable range [0x20;0x7E] are escaped, as well as backslash (0x2F) and double-quote (0x22).
The function Scanf.unescaped
is a left inverse of escaped
, i.e. Scanf.unescaped (escaped s) = s
for any string s
(unless escaped s
fails).
uppercase_ascii s
is s
with all lowercase letters translated to uppercase, using the US-ASCII character set.
lowercase_ascii s
is s
with all uppercase letters translated to lowercase, using the US-ASCII character set.
capitalize_ascii s
is s
with the first character set to uppercase, using the US-ASCII character set.
uncapitalize_ascii s
is s
with the first character set to lowercase, using the US-ASCII character set.
Traversing
iter f s
applies function f
in turn to all the characters of s
. It is equivalent to f s.[0]; f s.[1]; ...; f s.[length s - 1]; ()
.
iteri
is like iter
, but the function is also given the corresponding character index.
Searching
index_from s i c
is the index of the first occurrence of c
in s
after position i
.
index_from_opt s i c
is the index of the first occurrence of c
in s
after position i
(if any).
rindex_from s i c
is the index of the last occurrence of c
in s
before position i+1
.
rindex_from_opt s i c
is the index of the last occurrence of c
in s
before position i+1
(if any).
index s c
is String.index_from
s 0 c
.
index_opt s c
is String.index_from_opt
s 0 c
.
rindex s c
is String.rindex_from
s (length s - 1) c
.
rindex_opt s c
is String.rindex_from_opt
s (length s - 1) c
.
Strings and Sequences
to_seq s
is a sequence made of the string's characters in increasing order.
to_seqi s
is like to_seq
but also tuples the corresponding index.
UTF decoding and validations
UTF-8
get_utf_8_uchar b i
decodes an UTF-8 character at index i
in b
.
is_valid_utf_8 b
is true
if and only if b
contains valid UTF-8 data.
UTF-16BE
get_utf_16be_uchar b i
decodes an UTF-16BE character at index i
in b
.
is_valid_utf_16be b
is true
if and only if b
contains valid UTF-16BE data.
UTF-16LE
get_utf_16le_uchar b i
decodes an UTF-16LE character at index i
in b
.
is_valid_utf_16le b
is true
if and only if b
contains valid UTF-16LE data.
Binary decoding of integers
The functions in this section binary decode integers from strings.
All following functions raise Invalid_argument
if the characters needed at index i
to decode the integer are not available.
Little-endian (resp. big-endian) encoding means that least (resp. most) significant bytes are stored first. Big-endian is also known as network byte order. Native-endian encoding is either little-endian or big-endian depending on Sys.big_endian
.
32-bit and 64-bit integers are represented by the int32
and int64
types, which can be interpreted either as signed or unsigned numbers.
8-bit and 16-bit integers are represented by the int
type, which has more bits than the binary encoding. These extra bits are sign-extended (or zero-extended) for functions which decode 8-bit or 16-bit integers and represented them with int
values.
get_uint8 b i
is b
's unsigned 8-bit integer starting at character index i
.
get_int8 b i
is b
's signed 8-bit integer starting at character index i
.
get_uint16_ne b i
is b
's native-endian unsigned 16-bit integer starting at character index i
.
get_uint16_be b i
is b
's big-endian unsigned 16-bit integer starting at character index i
.
get_uint16_le b i
is b
's little-endian unsigned 16-bit integer starting at character index i
.
get_int16_ne b i
is b
's native-endian signed 16-bit integer starting at character index i
.
get_int16_be b i
is b
's big-endian signed 16-bit integer starting at character index i
.
get_int16_le b i
is b
's little-endian signed 16-bit integer starting at character index i
.
get_int32_ne b i
is b
's native-endian 32-bit integer starting at character index i
.
An unseeded hash function for strings, with the same output value as Hashtbl.hash
. This function allows this module to be passed as argument to the functor Hashtbl.Make
.
A seeded hash function for strings, with the same output value as Hashtbl.seeded_hash
. This function allows this module to be passed as argument to the functor Hashtbl.MakeSeeded
.
get_int32_be b i
is b
's big-endian 32-bit integer starting at character index i
.
get_int32_le b i
is b
's little-endian 32-bit integer starting at character index i
.
get_int64_ne b i
is b
's native-endian 64-bit integer starting at character index i
.
get_int64_be b i
is b
's big-endian 64-bit integer starting at character index i
.
get_int64_le b i
is b
's little-endian 64-bit integer starting at character index i
.
Strings
empty
is ""
.
head s
if Some s.[0]
if s <> ""
and None
otherwise.
of_char c
is c
as a string.
Predicates
is_empty s
is equal empty s
.
includes ~affix s
is true
iff there exists an index j
such that for all indices i
of affix
, sub.[i] = s.[j+ 1]
.
Finding indices
TODO. Harmonize indexing errors with find_first
. This never raises.
find_first_index ~start sat
is the index of the first character of s
that satisfies sat
after or at start
(defaults to 0
).
find_last_index ~start sat
is the index of the last character of s
that satisfies sat
before or at start
(defaults to String.length s - 1
).
Finding substrings
find_first ~start ~sub s
is the start position (if any) of the first occurence of sub
in s
after or at position start
(which includes index start
if it exists, defaults to 0
). Note if you need to search for sub
multiple times in s
use find_sub_all
it is more efficient.
find_last ~start ~sub s
is the start position (if any) of the last occurences of sub
in s
before or at position start
(which includes index start
if it exists, defaults to String.length s
).
Note if you need to search for sub
multiple times in s
use rfind_sub_all
it is more efficient.
find_all ~start f ~sub s acc
, starting with acc
, folds f
over all non-overlapping starting positions of sub
in s
after or at position start
(which includes index start
if it exists, defaults to 0
). This is acc
if sub
could not be found in s
.
rfind_all ~start f ~sub s acc
, starting with acc
, folds f
over all non-overlapping starting positions of sub
in s
before or at position start
(which includes index start
if it exists, defaults to String.length s
). This is acc
if sub
could not be found in s
.
Replacing substrings
replace_first ~start ~sub ~by s
replaces by by
in s
the first occurence of sub
at or after position start
(which includes index start
if it exists, defaults to 0
) by by
.
replace_last ~start ~sub ~by s
replaces by by
in s
the last occurence of sub
at or before position start
(which includes index start
if it exists, defaults to String.length s
).
replace_all ~start ~sub ~by
replaces in s
all non-overlapping occurences of sub
at or after position start
(default to 0
) by by
.
Extracting substrings
subrange ~first ~last s
are the consecutive bytes of s
whose indices exist in the range [first
;last
].
first
defaults to 0
and last to String.length s - 1
.
Note that both first
and last
can be any integer. If first > last
the interval is empty and the empty string is returned.
Breaking
Breaking with magnitudes
take_first n s
are the first n
bytes of s
. This is s
if n >= length s
and ""
if n <= 0
.
take_last n s
are the last n
bytes of s
. This is s
if n >= length s
and ""
if n <= 0
.
drop_first n s
is s
without the first n
bytes of s
. This is ""
if n >= length s
and s
if n <= 0
.
drop_last n s
is s
without the last n
bytes of s
. This is ""
if n >= length s
and s
if n <= 0
.
cut_first n v
is (take_first n v, drop_first n v)
.
cut_last n v
is (drop_last n v, take_last n v)
.
Breaking with predicates
take_first_while sat s
are the first consecutive sat
statisfying bytes of s
.
take_last_while sat s
are the last consecutive sat
satisfying bytes of s
.
drop_first_while sat s
is s
without the first consecutive sat
satisfying bytes of s
.
drop_last_while sat s
is s
without the last consecutive sat
satisfying bytes of s
.
cut_first_while sat s
is (take_first_while sat s, drop_first_while sat s)
.
cut_last_while sat s
is (drop_last_while sat s, take_last_while sat s)
.
Breaking with separators
split_first ~sep s
is the pair Some (left, right)
made of the two (possibly empty) substrings of s
that are delimited by the first match of the separator sep
or None
if sep
can't be matched in s
. Matching starts at position 0
using find_first
.
The invariant concat sep [left; right] = s
holds.
split_last ~sep s
is like split_first
but matching starts at position length s
using find_last
.
split_all ~sep s
is the list of all substrings of s
that are delimited by non-overlapping matches of the separator sep
. If sep
can't be matched in s
, the list [s]
is returned. Matches starts at position 0
and are determined using find_all
.
Substrings sub
for which drop sub
is true
are not included in the result. drop
default to Fun.const false
.
The invariant concat sep (split_all ~sep s) = s
holds.
Breaking lines
val fold_ascii_lines :
strip_newlines:bool ->
(int -> 'a -> string -> 'a) ->
'a ->
string ->
'a
fold_ascii_lines ~strip_newlines f acc s
folds over the lines of s
by calling f linenum acc' line
with linenum
the one-based line number count, acc'
the result of accumulating acc
with f
so far and line
the data of the line (without the newline found in the data if strip_newlines
is true
).
Lines are delimited by newline sequences which are either one of "\n"
, "\r\n"
or "\r"
. More precisely the function determines lines and line data as follows:
- If
s = ""
, the function considers there are no lines ins
andacc
is returned withoutf
being called. - If
s <> ""
,s
is repeteadly split on the first newline sequences"\n"
,"\r\n"
or"\r"
into(left, newline, right)
,left
(orleft ^ newline
whenstrip_newlines = false
) is given tof
and the process is repeated withright
until a split can no longer be found. At that point this final string is given tof
and the process stops.
detach_ascii_newline s
is (data, endline)
with:
endline
either the suffix"\n"
,"\r\n"
or"\r"
ofs
or""
ifs
has no such suffix.data
the bytes beforeendline
such thatdata ^ newline = s
Tokenize
val next_token :
?is_sep:(char -> bool) ->
?is_token:(char -> bool) ->
string ->
string * string
next_token ~is_sep ~is_token s
skips characters satisfying is_sep
from s
, then gather zero or more consecutive characters satisfying is_token
into a string which is returned along the remaining characters after that. is_sep
defaults to Char.Ascii.is_white
and is_token
is Char.Ascii.is_graphic
.
tokens s
are the strings separated by sequences of is_sep
characters (default to Char.Ascii.is_white
). The empty list is returned if s
is empty or made only of separators.
Uniqueness
distinct ss
is ss
without duplicates, the list order is preserved.
unique ~limit ~exist n
is n
if exists n
is false
or r = strf "%s~%d" n d
with d
the smallest integer such that exists r
if false
. If no d
in [1
;limit
] satisfies the condition Invalid_argument
is raised, limit
defaults to 1e6
.
Spellchecking
All additions available in OCaml 5.4
edit_distance s0 s1
is the number of single character edits (understood as insertion, deletion, substitution, transposition) that are needed to change s0
into s1
.
If limit
is provided the function returns with limit
as soon as it was determined that s0
and s1
have distance of at least limit
. This is faster if you have a fixed limit, for example for spellchecking.
The function assumes the strings are UTF-8 encoded and uses Uchar.t
for the notion of character. Decoding errors are replaced by Uchar.rep
. Normalizing the strings to NFC gives better results.
Note. This implements the simpler Optimal String Alignement (OSA) distance, not the Damerau-Levenshtein distance. With this function "ca"
and "abc"
have a distance of 3 not 2.
val spellcheck :
?max_dist:(string -> int) ->
((string -> unit) -> unit) ->
string ->
string list
spellcheck iter_dict s
are the strings enumerated by the iterator iter_dict
whose edit distance to s
is the smallest and at most max_dist s
. If multiple corrections are returned their order is as found in iter_dict
. The default max_dist s
is:
0
ifs
has 0 to 2 Unicode characters.1
ifs
has 3 to 4 Unicode characters.2
otherwise.
If your dictionary is a list l
, a suitable iter_dict
is given by (fun yield -> List.iter yield l)
.
All strings are assumed to be UTF-8 encoded, decoding errors are replaced by Uchar.rep
characters.
(Un)escaping bytes
The following functions can only (un)escape a single byte. See also these functions to convert a string to printable ASCII characters.
byte_escaper char_len set_char
is a byte escaper such that:
char_len c
is the length of the unescaped bytec
in the escaped form. If1
is returned thenc
is assumed to be unchanged usebyte_replacer
if that does not holdset_char b i c
sets an unescaped bytec
to its escaped form at indexi
inb
and returns the next writable index.set_char
is called regardless ifc
needs to be escaped or not in the latter case you must writec
(usebyte_replacer
if that is not the case). No bounds check need to be performed oni
or the returned value.
For any b
, c
and i
the invariant i + char_len c = set_char b i c
must hold.
Here's a small example that escapes '"'
by prefixing them by backslashes. double quotes from strings:
let escape_dquotes s =
let char_len = function '"' -> 2 | _ -> 1 in
let set_char b i = function
| '"' -> Bytes.set b i '\\'; Bytes.set b (i+1) '"'; i + 2
| c -> Bytes.set b i c; i + 1
in
String.byte_escaper char_len set_char s
byte_replacer char_len set_char
is like byte_escaper
but a byte can be substituted by another one by set_char
.
See byte_unescaper
.
val byte_unescaper :
(string -> int -> int) ->
(bytes -> int -> string -> int -> int) ->
string ->
(string, int) result
byte_unescaper char_len_at set_char
is a byte unescaper such that:
char_len_at s i
is the length of an escaped byte at indexi
ofs
. If1
is returned then the byte is assumed to be unchanged by the unescape, usebyte_unreplacer
if that does not hold.set_char b k s i
sets at indexk
inb
the unescaped byte read at indexi
ins
and returns the next readable index ins
.set_char
is called regardless of wheter the byte ati
must be unescaped or not in the latter case you must write s.i
only (usebyte_unreplacer
if that is not the case). No bounds check need to be performed onk
,i
or the returned value.
For any b
, s
, k
and i
the invariant i + char_len_at s i = set_char b k s i
must hold.
Both char_len_at
and set_char
may raise Illegal_escape i
if the given index i
has an illegal or truncated escape. The unescaper turns this exception into Error i
if that happens.
val byte_unreplacer :
(string -> int -> int) ->
(bytes -> int -> string -> int -> int) ->
string ->
(string, int) result
byte_unreplacer char_len_at set_char
is like byte_unescaper
except set_char
can set a different byte whenever char_len_at
returns 1
.
ASCII strings
Variable substitution
subst_pct_vars ~buf vars s
substitutes in s
sub-strings of the form %%VAR%%
by the value of vars "VAR"
(if any).
ANSI stripping
strip_ansi_escapes s
removes ANSI escapes from s
.