Library
Module
Module type
Parameter
Class
Class type
Parsing Library
The parsers of this library implement parsers for Parsing Expression Grammars with parsing combinators. They have the following main features:
parsec
of Haskell) you have the full flexibility of a functional language. There is no preprocessing step where the parser has to be generated from the grammar.These features in combination are to the best of our knowledge unique in Fmlib
.
Parsing expression grammars look are very similar to context free grammers. There are two main differences:
Accepting these restrictions leads to fairly efficient parser which can be implemented directly in a functional language.
A parsing expression in a parsing expression grammar is one of:
e₁ e₂ ...
e₁ / e₂ / ...
e*
e+
You build a parser by first using combinators to express your grammar. A combinator has type 'a t
which parses some part of the input stream and returns an object of type 'a
in case of success.
All parsers have some elementary combinators which parse one token and succeed or fail. E.g. the character parser has a combinator char c
which succeeds, if the next character is c
and fails otherwise. Or there is the combinator charp p
where p
is a function of type char -> bool
. This combinator succeeds if the next character satisfies p
and fails otherwise.
As the name implies, combinators can be combined to form more complex combinators. The let*
combinator allows to form sequences of grammar constructs. E.g. if we have some combinators p
, q
and r
which parse certain grammar constructs, we can combine them sequentially by
let* x = p in
let* y = q in
let* z = r in
return (f x y z)
This combinator first parses p
which returns in case of success some value x
. Then it parses q
which returns a y
. Then it parses r
which returns a z
. At the end it computes the compound result from the individual results x
, y
and z
.
The sequence combinator fails, if one of its components fail and it succeeds, if all components succeed.
Needless to say that the combinators p
, q
and r
can receive arguments and q
and r
can use x
and r
can use x
and y
.
For the biased choice there is the operator </>
. Having three combinators p
, q
and r
of the same type the expression
p </> q </> r
is a new combinator which first parses p
. In case p
fails without consuming token, the next combinator q
is used and so on.
There is the combinator backtrackable p
which let p
fail without consuming input even if it fails after consuming some token.
More detailed combinators can be seen in the APIs of the different parsers.
In order to do the actual parsing we need some combinator c
which has type final t
where final
is the desired result after parsing successfully the complete input stream. From this combinator we can create a complete parser by
make initial_state c
This object is the parser object. If you don't need a user state, then the initial state is just unit ()
. You can push token into the parser object by
put token p
which returns a parser object into which step by step all token can be pushed. At the end of input you call
put_end p
which terminates the parse. There are methods like
needs_more p (* Does the parser need more input? *)
has_ended p (* Has the parser finished parsing? *)
has_succeeded p (* Has the parser finished successfully? *)
final p (* The final result after successful parsing. *)
...
In the following we define a parser which parses and evaluates arithmetic expressions with addition, substraction, multiplication and division. Only division can trigger the semantic failure division by zero.
We want to parse a stream of characters.
module Semantic = struct
type t = Pretty.Print.doc
end
module CP =
Parse.Character.Make
(Unit) (* No state needed. *)
(Int) (* The parser returns a number. *)
(String) (* The possible semantic error. *)
open CP
Some combinators to parse whitespace, arithmetic operators and numbers.
let whitespace: int t =
skip_zero_or_more (char ' ')
type addop = Plus | Minus
type mulop = Times | Divide
let operator (c: char) (op: 'a): 'a t =
map (fun _ -> op) (char c)
let addop: addop t =
(* Parse an addition operator. *)
let* op = operator '+' Plus </> operator '-' Minus in
let* _ = whitespace in (* strip whitespace *)
return op
let mulop: mulop t =
(* Parse a multiplication operator. *)
let* op = operator '*' Times </> operator '/' Divide in
let* _ = whitespace in (* strip whitespace *)
return op
let number: int t =
(* Parse one number. *)
let* v =
one_or_more
(fun d -> d)
(fun v d -> 10 * v + d)
digit
in
let* _ = whitespace in (* strip whitespace *)
return v
The high precedence operations are multiplication and division. For the division operation we have to handle division by zero.
let rec factors (opnd1: int): int t =
(* Parse the factors of a product. *)
(
let* op = mulop in
let* opnd2 = number
in
match op with
| Times ->
factors (opnd1 * opnd2)
| Divide ->
if opnd2 = 0 then
fail "division by zero"
else
factors (opnd1 / opnd2)
)
</>
return opnd1
let product: int t =
(* Parse a product [f1 * f2 / f3 ...]. *)
let* n = number in
factors n
Note that the recursion in factors
is guarded. A recursiv call happens only if there is a multiplication operator and an operand.
In order to parse sums we can directly use the combinator one_or_more_separated
.
let expr: int t =
(* Parse a sum [a + b - c ...]. *)
one_or_more_separated
(fun x -> x)
(fun s op x ->
match op with
| Plus ->
s + x
| Minus ->
s - x)
product
addop
As a last step we convert the combinator expr
to a parser by
let calculator: Parser.t =
make () expr
and run it on a string
let p = Parser.run_on_string "1 + 2 * 6 / 2" calculator
assert (Parser.has_succeeded p);
assert (Parser.final p = 7);
let p = Parser.run_on_string "1 / 0" calculator
assert (Parser.has_failed_semantic p)
It is easy to generate a parser which enters an infinite loop. This has to be avoided. General rule: Each recursive call must be protected by some combinator which consume token before the recursive call happens.
In order to demonstrate valid recursion we use the calculator example of the previous chapter and add parenthesized expressions.
let parenthesized (p: unit -> 'a t): 'a t =
let* _ = char '(' in
let* _ = whitespace in
let* x = p () in
let* _ = char ')' in
let* _ = whitespace in
return x
This combinator parses the combinator p
between parentheses. Instead of using an argument of type 'a t
we use an argument of type unit -> 'a t
because the combinator p
is usually called recursively.
The body of parenthesized
protects a possible recursive call of p ()
by prefixing it by the combinator char '('
which certainly consumes a token in case of success.
As opposed to Haskell, Ocaml is a strict language. In Haskell this particular problem of infinite recursion does not occur, because the Haskell compiler treats internally every expression as a thunk (i.e. of type unit
-> e
instead of e
) and evaluates the thunk only if needed.
Now we can generate a combinator for parenthesized expressions by some mutually recursive functions.
let rec expr (): int t =
... (* as above using "product ()" instead of "product" *)
and atomic (): int t =
number </> parenthesized expr (* recursive call to "expr ()"
not yet done! *)
and product (): int t =
let* n = atomic () in
factors n
and factors (opnd1: int): int t =
let* op = mulop in
let* opnd2 = atomic () in
... (* same as above *)
Note that Ocaml does not allow recursive constants. Therefore all elements of a set of mutually recursive functions must be functions. Therefore we added a unit argument to combinators which do not have other arguments.
Nearly all combinators do not backtrack. I.e. all choices are done by looking only at the first token of a construct. If a construct fails after consuming token, no alternative will be checked.
This make parsing fast. However sometimes it is necessary to backtrack a failed combinator as if it had not consumed any token. After backtracking, any alternative combinator can be tried. This makes parsing more expensive, because the consumed token have to be pushed back to the lookahead and reparsed by the alternative combinators.
There are three basic backtracking combinators.
backtrack p expect
: Parse p
. In case of failure push all consumed token back to the lookahead and continue with possible other choices. Push the expectation expect
to the failed expectations.followed_by p expect
: Parse p
and push all cosumed token back to the lookahead. Succeed, if p
succeeded and fail, if p
failed.not_followed_by p expect
: Parse p
and push all cosumed token back to the lookahead. Succeed, if p
failed and fail, if p
succeeded.With the character parser we can parse indented and aligned structures. Each construct gets a set of allowed indentations. The set has a lower indentation bound and an upper indentation bound. The upper indentation bound can be infinite.
Initially the lower bound of the indentation set is zero and the upper bound is infinite (i.e. the indentation set is {0, 1, ... }
)
No token of a construct is located to the left of the upper indentation bound. If the upper bound is infinite, then token have not yet been encountered. During parsing of a construct token are encountered and its indentation set is finite.
The default case is that any construct inherits its indentation set from its parent.
With the combinator
indent i p
we can indent the construct parsed by p
by at least i
columns relative to its parent.
To parse aligned constructs we need a parent construct which contains the aligned children. The only purpose of the parent construct is to contain its aligned children. Usually the parent construct is indented (maybe by zero) relative to its parent. E.g. if we want to parse the combinators p
and q
aligned, we do this by
let parent =
let* a = align p in
let* b = align q in
return (a,b)
in
indent 1 (align parent)
In that case parent
is the parent construct which is indented relative to its parent and contains the two aligned children p
and q
.
The combinator align
aligns within the allowed indentations and the combinator left_align
aligns leftmost within the allowed indentations.
It is important to exclude whitespace from all alignment and indentation requirements. Each line might start with blanks which can violate the requirements. In order to exclude constructs from the alignment and indentation requirements we can use the combinator
detach p
which parses p
without any indentation and alignment requirements.
module Character : sig ... end
Character Parser: An indentation sensitive parser which parses streams of characters i.e. the token type is char
.
module Generic : sig ... end
A Generic Parser where all parameters are customizable.
module Position : sig ... end
Represent a position in a text file.
module Located : sig ... end
A parsing construct located within a file.
module Indent : sig ... end
The allowed indentations: Helper module for indentation sensitive parsing.
module Interfaces : sig ... end