~~Simple, efficient, sound and complete combinator parsing for all context-free grammars, using an oracle~~

The P3 combinator parsing library

(Talk at OCaml 2014)

Dr Tom Ridge

2014-09-05

# Main features

Parsing library (also has a parser generator component); interactive development

Can handle all CFGs (including left-recursive CFGs)

Scannerless (parse strings directly without lexing), or can use a lexer

Good theoretical basis e.g. sound and complete (see here for what these mean formally)

Good performance (for a general parser)

Can handle some fancy examples

Anti-feature: currently nowhere near as fast as deterministic parsers eg ocamlyacc

# This talk

- We look at some examples, not at the theory

# Example

- Consider the grammar
`E -> E E E | "1" | eps`

```
(* Grammar: E -> E E E | "1" | eps *)
let rec parse_E = (fun i -> (mkntparser "E" (
(parse_E ***> parse_E ***> parse_E)
|||| (a "1")
|||| (a "")))
i)
```

# What does complete mean?

Completeness naively means: all parse trees are returned

What if there are an infinite number of parse trees?

Given a grammar, define a class of "good" parse trees. The class is finite (but potentially exponentially large).

"Good" parse trees are all you need - not in this talk.

Completeness means: all good parse trees are returned.

# Example: parse trees

```
let rec parse_E = (fun i -> mkntparser "E" (
((parse_E ***> parse_E ***> parse_E) >>>> (fun (x,(y,z)) -> `Node(x,y,z)))
|||| ((a "1") >>>> (fun _ -> `LF("1")))
|||| ((a "") >>>> (fun _ -> `LF(""))))
i)
```

```
Input size|Number of results (good parse trees)
0 |1
1 |1
2 |3
3 |19
4 |150
...
19 |441152315040444150
```

See here

Fact: computing results for inputs of size 19 or larger is not feasible

Note: exponential number of results means this must take (at least) an exponential amount of time

Note: this has nothing to do with parsing, it is due to the actions

# Conceptual distinction: parsing is not applying actions

Parsing is different from applying the actions

Parsing can be done in polytime (e.g. using Earley). Results can be computed in polytime and stored in polyspace.

Applying actions takes how long? Well, it depends on the actions.

Applying actions over all good parse trees (where each action takes a constant time), takes an exponential amount of time

# Example: counting

```
(* Grammar: E -> E E E | "1" | eps *)
let rec parse_E = (fun i -> (mkntparser "E" (
((parse_E ***> parse_E ***> parse_E) >>>> (fun (x,(y,z)) -> x+y+z))
|||| ((a "1") >>>> (fun _ -> 1))
|||| ((a "") >>>> (fun _ -> 0))))
i)
```

```
Input size|Number of results
0 |1
1 |1
2 |1
4 |1
...
19 |1
```

# Example: memoized counting (YCDTWAOP)

```
let rec parse_E =
let tbl_E = MyHashtbl.create 100 in
(fun i -> memo_p3 tbl_E (mkntparser "E" (
((parse_E ***> parse_E ***> parse_E) >>>> (fun (x,(y,z)) -> x+y+z))
|||| ((a "1") >>>> (fun _ -> 1))
|||| ((a "") >>>> (fun _ -> 0))))
i)
```

This is polytime - inputs of length 100 (!!!) take a few seconds to process and return a single result

- The library has been designed to ensure "optimal" performance at each stage.
- If you want to write highly ambiguous grammars then
**parsing** is still `O(n^3)`

.
- The performance when applying
**actions** can be optimized using memoization

*Simple semantics:* compute actions over all (good) parse trees; *there are exponentially many such parse trees, but this doesn't have to take exponential time*

What this means is that, providing your actions don't cause exponentially many results to be returned, performance is often pretty reasonable

# Example: disambiguation

- Arithmetic expressions are pretty basic, but parsing and disambiguating is somewhat complicated

`E -> E "+" E | E "-" E | ...`

- One approach: fiddle with the grammar (this uses CFG structure to encode associativity and precedence) link

```
<Exp> ::= <Exp> + <Term> | <Exp> - <Term> | <Term>
<Term> ::= <Term> * <Factor> | <Term> / <Factor> | <Factor>
<Factor> ::= x | y | ... | ( <Exp> ) | - <Factor>
```

- Another approach: add associativity and precedence directly to the parser link; the following expresses associativity and precedence (later declarations have higher precedence)

```
%left PLUS MINUS
%left MULTIPLY DIVIDE
%left NEG /* negation -- unary minus */
%right CARET /* exponentiation */
```

- Both these approaches can be used by P3 without any special support in the parser e.g. see here

# Example: disambiguation (YCDTWAOP)

A **general** approach: rewrite parse trees (or, less generally, throw away ones you don't want)

The following is for right-assoc + and *; we return an option (`Some`

if the parse is acceptable, `None`

if the parse is not acceptable) link

Suppose input is `1*2+3`

; this should be parsed as `(1*2)+3`

, not `1*(2+3)`

```
E -> E "+" E {{ fun (x,(_,y)) -> (match x,y with
| (Some(Plus _),Some _) -> None (* not (x+y)+z ! *)
| (Some x,Some y) -> Some(Plus(x,y))
| _ -> None) }}
| E "*" E {{ fun (x,(_,y)) -> (match x,y with
| (Some (Times _),Some _) -> None (* not (x*y)*z ! *)
| (Some (Plus _),Some _) -> None (* not (x+y)*z ! *)
| (Some x,Some(Plus _)) -> None (* not x*(y+z) ! <-- *)
| (Some x,Some y) -> Some(Times(x,y))
| _ -> None) }}
| ?num? {{ fun s -> Some(Num(int_of_string (content s))) }}
```

# Example: modular combination of parsers (YCDTWAOP probably)

Consider grammars that are X (where X is LALR(1) or LR(1) or ...)

In general, combining two X grammars does not result in an X grammar. So you can't modularly specify and combine grammars from such classes without moving to a more general class.

In contrast, two CFGs can be combined to form a CFG. So you can modularly specify and combine such grammars.

Example modular specification and combination of parsers and evaluators for arithmetic, boolean expressions, and lambda calculus here

```
(* w is (possibly zero length) whitespace *)
let ( ***> ) x y = (x ***> w ***> y) >>>> (fun (x,(_,y)) -> (x,y))
let rec parse_A h = (fun i -> mkntparser "arithexp" (
(((parse_A h) ***> (a "+") ***> (parse_A h))
>>>> (fun (e1,(_,e2)) -> `Plus(e1,e2)))
|||| (((parse_A h) ***> (a "-") ***> (parse_A h))
>>>> (fun (e1,(_,e2)) -> `Minus(e1,e2)))
|||| (parse_num
>>>> (fun s -> `Num(int_of_string (content s))))
|||| (((a "(") ***> (parse_A h) ***> (a ")")) (* brackets *)
>>>> (fun (_,(e,_)) -> e))
|||| h) (* helper parser *)
i)
```

Note the presence of a helper parser...; this parses "all the other types of expressions"

Define `parse_B`

for booleans, and `parse_L`

for lambda expressions

```
let rec parse_L h = (fun i -> mkntparser "lambdaexp" (
(((a "\\") ***> parse_var ***> (parse_L h)) (* lam *)
>>>> (fun (_,(x,body)) -> `Lam(x,body)))
|||| (((parse_L h) ***> (parse_L h)) (* app *)
>>>> (fun (e1,e2) -> `App(e1,e2)))
|||| (parse_var >>>> (fun s -> `Var s)) (* var *)
|||| (((a "(") ***> (parse_L h) ***> (a ")")) (* brackets *)
>>>> (fun (_,(e,_)) -> e))
|||| h) (* helper parser *)
i)
```

```
let parse_U = (
let rec l i = parse_L h i
and a i = parse_A h i
and b i = parse_B h i
and h i = (l |||| a |||| b) i
in
l)
```

Arrrrggghhh! What about termination? Relax... the approach is complete, which means that you get all results, which means (inter alia) that the parsers always terminate

Then define an evaluator in the usual way, and finally, combine the parser and the evaluator

```
let parse_and_eval txt = (remove_err (
p3_run_parser
((parse_memo_U ()) >>>> eval empty_env)
txt))
let y = "(\\ f ((\\ x (f (x x))) (\\ x (f (x x)))))"
(* sigma; let rec sigma x = if x < 2 then 1 else x+(sigma (x-1)) *)
let sigma = "(\\ g (\\ x (if (x < 2) then 1 else (x+(g (x-1))))))"
(* following is a lambda calc version of the sigma function, applied to argument 5 *)
let txt = "("^y^" "^sigma^") 5"
let [r] = parse_and_eval txt
```

# Conclusion

It works and is usable for prototyping and small applications

The back-end Earley parser is highly engineered, and probably of use in other applications

I'm still unhappy with the combinator interface (eta expansion, naming of parsers)

I'm still working to improve real-world performance further

Hopefully the talk has shown you a few things you might not have seen before in any other parser library

Long term aim: general parser, verified correctness and performance, usable in the real-world

https://github.com/tomjridge/p3