Library
Module
Module type
Parameter
Class
Class type
Pacomb.Grammar
The main module to build grammarsPacomb.Lex
lexing and Pacomb.Lex.give_up
Pacomb.Blank
blanks and layout recordPacomb.Pos
positions in file and handling of exceptionsPacomb.Input
build input buffer from string, channel, ...Pacomb.Charset
an efficient representation of character setsPaComb implements a representation of grammars with semantic actions (values returned as a result of parsing). Parsing is performed by compiling grammars defined with the Pacomb.Grammar
module (or indirectly though a PPX extension) to combinators. The library offers scanner less parsing, but the Pacomb.Lex
module provide a notion of terminals and blanks that give a simple way to write grammars in two phases, as usual.
The main advantage of PaComb and similar solutions, contrary to ocamlyacc, is that grammars (compiled or not) are first class values. This allows using the full power of OCaml for manipulating grammars. For example, this is very useful when working with syntax extension mechanisms.
Importantly, performances of PaComb are very good: it is only two to three times slower than grammars generated by ocamlyacc.
Defining languages using the Pacomb.Grammar
module directly is cumbersome. For that reason, PaComb provides a BNF-like PPX syntax extension. An example with arithmetic expressions is given below. It can be compiled with command ocamlfind ocamlopt -package pacomb,pacomb.ppx -o calc -linkpkg calc.ml
(if it is written to a file calc.ml
).
type p = Atom | Prod | Sum
let%parser rec expr p =
Atom < Prod < Sum
; (p=Atom) (x::FLOAT) => x
; (p=Atom) '(' (e::expr Sum) ')' => e
; (p=Prod) (x::expr Prod) '*' (y::expr Atom) => x *. y
; (p=Prod) (x::expr Prod) '/' (y::expr Atom) => x /. y
; (p=Sum ) (x::expr Sum ) '+' (y::expr Prod) => x +. y
; (p=Sum ) (x::expr Sum ) '-' (y::expr Prod) => x -. y
This works using the extensions [%%parser ...]
for structures and [%parser
...]
for expression. These are also accessible by suffixing keywords with %parser
as in the above example. These ppx extensions extends ocaml expressions with a syntax for grammars of type 'a Grammar.t
and modifies the behaviour of let-bindings especially recursive ones to use declare_grammar
, set_grammar
and grammar_family
. Recall that due to the limitation of ppx, we use a sub-syntax of OCaml expressions for grammars. It is therefore not a good idea to use "=>" as an infix inside [%parser ...]
.
We give below the BNF grammar for the extension, together with a sketch of its semantics.
grammar ::= rule itself
| grammar ; rule Grammar.alt
rule ::= qitems => expr A rule with its action
| expr < ... < expr priority order see below
| ERROR(m) report an error if parsing fails here
qitems ::= () Grammar.empty
| non_empty_qitems itself
| cond non_empty_qitems conditional rule
cond ::= expr bool_op expr
| not expr
non_empty_qitems ::= qitem itself
| non_empty_qitems qitems Grammar.seq
qitem ::= item itself
| (epat :: item) give a name if used in the action
| ((epat,epat) >: item) as above, but for dseq
item ::= '...' Grammar.term(Lex.char ())
| CHAR Grammar.term(Lex.any ())
| CHAR(c) Grammar.term(Lex.char c)
| "..." Grammar.term(Lex.string ())
| STRING(s) Grammar.term(Lex.string s)
| NAT Grammar.term(Lex.nat ())
| INT Grammar.term(Lex.int ())
| FLOAT Grammar.term(Lex.float ())
| UTF8 Grammar.term(Lex.utf8 ())
| RE(expr) Grammar.term(Lex.regexp (Regexp.from_string expr))
| STRING_LIT Grammar.term(Lex.string_lit ())
| CHAR_LIT Grammar.term(Lex.char_lit ())
| RE(expr) Grammar.term(Lex.regexp (Regexp.from_string expr))
| expr itself
| ~? expr Grammar.option expr
| ~? [expr] expr Grammar.option_default expr expr
| ~* expr Grammar.star expr
| ~* [expr] expr Grammar.star_sep expr expr
| ~+ expr Grammar.plus expr
| ~+ [expr] expr Grammar.plus_sep expr expr
epat ::= lid
| (epat : coretype)
| epat = lid encoding of [pat as lid]
| (epat, ..., pat)
| M.epat
epat
correspond to an encoding of patterns in expressions. Beware that _
is invalid, use __
instead. cond
is an expression using any test operator: "="|"<"|">"|"<="|">="|"<>"|"=="|"!="|"not"|"&&"|"||".
In action code (expression right of =>
), a lid_lpos
or lid_rpos
will denote respectively the left and right position of the item named lid
. a lid_pos
will group both lid_lpos
and lid_rpos
in a record of type Pos.interval
. If the item is matched by a tuple and you want to use its position you must use pat = lid
syntax to give a name to the whole item.
Action code needs parenthesis or begin ... end
if it uses if .. then
, pattern matching or sequence.
Beware that inside the scope of the extension, you can use the syntax for grammars everywhere. This allows for some nesting as in:
type p = Atom | Prod | Sum
let%parser rec
expr p = Atom < Prod < Sum
; (p=Atom) (x::FLOAT) => x
; (p=Atom) '(' (e::expr Sum) ')' => e
; (p=Prod) (x::expr Prod) => ( '*' (y::expr Atom) => x*.y
; '/' (y::expr Atom) => x/.y)
; (p=Sum ) (x::expr Sum ) => ('+' (y::expr Prod) => x+.y
; '-' (y::expr Prod) => x-.y)
But when using this, beware that x
is not available before the final action code, it can not be used for selecting grammar rule. (p::prio) => (x::g p) => x
will report p
as unbounded. To solve this, you can use dependent sequences, using (x,y)>:item
will allow x
(but not y
) to be used both in the action and the rule. The separation with dependent and non dependent part is crucial as dependent grammar are memoised and you don't want "noise". Here is an example of grammar using this to implement an extensible calculator (see tests/calc_ext.ml
):
let%parser rec
expr pmax = ((pe,e1)>:expr pmax)((pop,b)>:op pe pmax)
((__,e2)::expr pop) => (pop, b e1 e2)
; (x::FLOAT) => (0.0,x)
; '(' (e::expr_top) ')' => (0.0,e)
Here are the meaning of let bindings for grammars through the ppx extension:
Grammar.declare_grammar + Grammar.set_grammar
(if no parameter)Grammar.grammar_family + setting the grammar
if parameters are given. multiple parameter and using label are supported through curryfication by the ppx extension.For recursive grammar with exactly one parameter, a rule p_1 < p_2 < ... < p_n
will automatically add rules to include the grammar parametrized by p_i
in the grammar parametrized by p_(i+1)
. This was used by the calculator example above.
let%parser
accepts three attribute:
[@cache]
to cache the grammar (that is call Grammar.cache
on the grammar)[@merge f]
to apply f
if two parsetrees are possible for the same input. This corresponds to Grammar.cache ~merge:f
.[@layout blank]
or [@layout blank ~config:expr]
to apply Grammar.layout and change the blank characters for the grammar.Anything which does not correspond to this grammar will be unchanged in the ocaml code (like the type definition in the example above). A mutually recursive definition can also mix the definition of grammars (parametric of not) with the definition of normal ocaml values. This means you could put the whole file inside %%parse ...
.
Pacomb must eliminate left recursion in grammars in order to use combinators that would loop otherwise. However, left recursion is not supported if it traverses A Pacomb.Grammar.layout
constructor to change blanks (probably possible to solve this, but probably not worth it).
Grammars are not left factorised automatically: (A B) | (A C) may parse A twice and this may result in very poor performance. Two solutions:
Note: left recursion do not need and is not eliminated if the grammar uses a cache. However, this solution to use combinator is too slow for non ambiguous grammars so we do not impose a cache to all left recursive grammars.
The ppx extension is not too bad but still suffers from the fact that it uses a sub-language of OCaml to describe grammar. For instance let%parser g =
((_,x)::g) => x
is not legal because _
cannot be used in an Ocaml expression. Though the following works: let%parser g = ((__,x)::g) => x
. The syntax (((x,y) = z) :: g) => (x,y,z_pos)
is not very nice as we use =
to replace the as
keyword and we also need a lot of parentheses.