Legend:
Library
Module
Module type
Parameter
Class
Class type
Very Simple Parser Combinators
These combinators can be used to write very simple parsers, for example to extract data from a line-oriented file, or as a replacement to Scanf.
A few examples
Some more advanced example(s) can be found in the /examples directory.
Parse a tree
open CCParse;;
type tree = L of int | N of tree * tree;;
let mk_leaf x = L x
let mk_node x y = N(x,y)
let ptree = fix @@ fun self ->
skip_space *>
( (char '(' *> (pure mk_node <*> self <*> self) <* char ')')
<|>
(U.int >|= mk_leaf) )
;;
parse_string_exn ptree "(1 (2 3))" ;;
parse_string_exn ptree "((1 2) (3 (4 5)))" ;;
Parse a list of words
open Containers.Parse;;
let p = U.list ~sep:"," U.word;;
parse_string_exn p "[abc , de, hello ,world ]";;
Stress Test
This makes a list of 100_000 integers, prints it and parses it back.
let p = CCParse.(U.list ~sep:"," U.int);;
let l = CCList.(1 -- 100_000);;
let l_printed =
CCFormat.(to_string (within "[" "]" (list ~sep:(return ",@,") int))) l;;
let l' = CCParse.parse_string_exn p l_printed;;
assert (l=l');;
Stability guarantees
Some functions are marked "experimental" and are still subject to change.
type position
A position in the input. Typically it'll point at the beginning of an error location.
parsing s p behaves the same as p, with the information that we are parsing s, if p fails. The message s is added to the error, it does not replace it, not does the location change (the error still points to the same location as in p).
set_error_message msg p behaves like p, but if p fails, set_error_message msg p fails with msg instead and at the current position. The internal error message of p is just discarded.
A slice of the input, as returned by some combinators such as split_1 or split_list or take.
The idea is that one can use some parsers to cut the input into slices, e.g. split into lines, or split a line into fields (think CSV or TSV). Then a variety of parsers can be used on each slice to extract data from it using recurse.
Slices contain enough information to make it possible for recurse slice p to report failures (if p fails) using locations from the original input, not relative to the slice. Therefore, even after splitting the input into lines using, say, each_line, a failure to parse the 500th line will be reported at line 500 and not at line 1.
set_current_slice slice replaces the parser's state with slice.
EXPERIMENTAL
since 3.6
val chars_fold :
f:
('acc->char ->[ `Continue of 'acc| `Consume_and_stop of 'acc| `Stop of 'acc| `Fail of string ])->'acc->('acc * slice)t
chars_fold f acc0 folds over characters of the input. Each char c is passed, along with the current accumulator, to f; f can either:
stop, by returning `Stop acc. In this case the final accumulator acc is returned, and c is not consumed.
consume char and stop, by returning `Consume_and_stop acc.
fail, by returning `Fail msg. In this case the parser fails with the given message.
continue, by returning `Continue acc. The parser continues to the next char with the new accumulator.
This is a generalization of of chars_if that allows one to transform characters on the fly, skip some, handle escape sequences, etc. It can also be useful as a base component for a lexer.
returns
a pair of the final accumular, and the slice matched by the fold.
since 3.6
val chars_fold_transduce :
f:
('acc->char ->[ `Continue of 'acc| `Yield of 'acc * char| `Consume_and_stop| `Stop| `Fail of string ])->'acc->('acc * string)t
Same as char_fold but with the following differences:
returns a string along with the accumulator, rather than the slice of all the characters accepted by `Continue _. The string is built from characters returned by `Yield.
new case `Yield (acc, c) adds c to the returned string and continues parsing with acc.
chars_if f parses a string of chars that satisfy f. Cannot fail.
val chars1_if : ?descr:string ->(char -> bool)->string t
Like chars_if, but accepts only non-empty strings. chars1_if p fails if the string accepted by chars_if p is empty. chars1_if p is equivalent to take1_if p >|= Slice.to_string.
parameterdescr
describes what kind of character was expected, in case of error
optional p tries to parse p, and return () whether it succeeded or failed. Cannot fail itself. It consumes input if p succeeded (as much as p consumed), but consumes not input if p failed.
try_opt p tries to parse using p, and return Some x if p succeeded with x (and consumes what p consumed). Otherwise it returns None and consumes nothing. This cannot fail.
many_until ~until p parses as many p as it can until the until parser successfully returns. If p fails before that then many_until ~until p fails as well. Typically until can be a closing ')' or another termination condition, and what is consumed by until is also consumed by many_until ~until p.
try_or p1 ~f ~else_:p2 attempts to parse x using p1, and then becomes f x. If p1 fails, then it becomes p2. This can be useful if f is expensive but only ever works if p1 matches (e.g. after an opening parenthesis or some sort of prefix).
since 3.6
val try_or_l : ?msg:string ->?else_:'at->(unit t * 'at) list->'at
try_or_l ?else_ l tries each pair (test, p) in order. If the n-th test succeeds, then try_or_l l behaves like n-th p, whether p fails or not. If they all fail, and else_ is defined, then it behaves like else_. If all fail, and else_ is None, then it fails as well.
This is a performance optimization compared to (<|>). We commit to a branch if the test succeeds, without backtracking at all.
See lookahead_ignore for a convenient way of writing the test conditions.
skip p parses zero or more times p and ignores its result. It is eager, meaning it will continue as long as p succeeds. As soon as p fails, skip p stops consuming any input.
lookahead_ignore p tries to parse input with p, and succeeds if p succeeds. However it doesn't consume any input and returns (), so in effect its only use-case is to detect whether p succeeds, e.g. in try_or_l.
line_str is line >|= Slice.to_string. It parses the next line and turns the slice into a string. The state points to the character immediately after the '\n' character.
each_line p runs p on each line of the input. EXPERIMENTAL
since 3.6
val split_1 : on_char:char ->(slice * slice option)t
split_1 ~on_char looks for on_char in the input, and returns a pair sl1, sl2, where:
sl1 is the slice of the input the precedes the first occurrence of on_char, or the whole input if on_char cannot be found. It does not contain on_char.
sl2 is the slice that comes after on_char, or None if on_char couldn't be found. It doesn't contain the first occurrence of on_char (if any).
The parser is now positioned at the end of the input.
split_list ~on_char splits the input on all occurrences of on_char, returning a list of slices.
EXPERIMENTAL
since 3.6
val split_list_at_most : on_char:char ->int ->slice listt
split_list_at_most ~on_char n applies split_1 ~on_char at most n times, to get a list of n+1 elements. The last element might contain on_char. This is useful to limit the amount of work done by split_list.
split_list_map ~on_char p uses split_list ~on_char to split the input, then parses each chunk of the input thus obtained using p.
The difference with sep ~by:(char on_char) p is that sep calls p first, and only tries to find on_char after p returns. While it is more flexible, this technique also means p has to be careful not to consume on_char by error.
A useful specialization of this is each_line, which is basically each_split ~on_char:'\n' p.
Memoize the parser. memo p will behave like p, but when called in a state (read: position in input) it has already processed, memo p returns a result directly. The implementation uses an underlying hashtable. This can be costly in memory, but improve the run time a lot if there is a lot of backtracking involving p.
Do not call memo inside other functions, especially with (>>=), map, etc. being so prevalent. Instead the correct way to use it is in a toplevel definition:
let my_expensive_parser = memo (foo *> bar >>= fun i -> …)
a <|> b tries to parse a, and if a fails without consuming any input, backtracks and tries to parse b, otherwise it fails as a. See try_ to ensure a does not consume anything (but it is best to avoid wrapping large parsers with try_).
a <?> msg behaves like a, but if a fails, a <?> msg fails with msg instead. Useful as the last choice in a series of <|>. For example: a <|> b <|> c <?> "expected one of a, b, c".