package earley

  1. Overview
  2. Docs

A BNF grammar is a list of rules. The type parameter 'a corresponds to the type of the semantics of the grammar. For example, parsing using a grammar of type int grammar will produce a value of type int.

type 'a grammar = info * 'a rule list

The type of a grammar, with its information

and _ symbol =
  1. | Term : {
    1. input : 'a input;
    2. info : info;
    3. memo : 'a tref;
    4. name : string;
    } -> 'a symbol
    (*

    terminal symbol just read the input buffer

    *)
  2. | Ter2 : {
    1. input : 'a input2;
    2. info : info;
    3. memo : 'a tref;
    4. name : string;
    } -> 'a symbol
    (*

    Ter2 correspond to a recursive call to the parser. We can change the blank function for instance, or parse input as much as possible. In fact it is only in the combinators in earley.ml that we see the use Ter2 to call the parser back.

    *)
  3. | Test : {
    1. input : 'a test;
    2. info : info;
    3. memo : 'a tref;
    4. name : string;
    } -> 'a symbol
    (*

    Test on the input, can for instance read blanks, usefull for things like ocamldoc (but not yet used by earley-ocaml).

    *)
  4. | NonTerm : {
    1. mutable rules : 'a rule list;
    2. mutable info : info;
    3. memo : 'a ntref;
    4. name : string;
    } -> 'a symbol
    (*

    Non terminals trough a mutable field to define recursive rule lists

    *)

The symbol, a more general concept that terminals

and 'a ntref = 'a rule list cref

Type of a container ref to memoize treatment of non terminal

and _ prerule =
  1. | Empty : 'a pos -> 'a prerule
    (*

    Empty rule.

    *)
  2. | Next : info * 'a symbol * ('a, 'c) next -> 'c prerule
    (*

    Sequence of a symbol and a rule, with a possible name for debugging, the information on the rule, the symbol to read, an action and the rest of the rule

    *)
  3. | Dep : ('a -> 'b rule) -> ('a -> 'b) prerule
    (*

    Dependant rule, gives a lot of power! but costly! use only when no other solution is possible

    *)

BNF rule.

and ('a, 'c) next =
  1. | Arg : ('a -> 'c) rule -> ('a, 'c) next
  2. | Pos : ('a -> 'c) fpos rule -> ('a, 'c) next
  3. | Ign : 'c rule -> ('a, 'c) next
and 'a rule = {
  1. rule : 'a prerule;
  2. ptr : 'a StackContainer.container;
  3. adr : int;
}

Each rule holds a container to associate data to the rule in O(1). see below the description of the type ('a,'b) pre_stack

type _ final =
  1. | D : {
    1. start : pos2;
      (*

      position in buffer, before and after blank

      *)
    2. stack : ('c, 'r) stack;
      (*

      tree of stack representing what should be done after reading the rest of the rule

      *)
    3. acts : 'b -> 'c;
      (*

      action to produce the final 'c.

      *)
    4. rest : 'b rule;
      (*

      remaining to parse, will produce 'b

      *)
    5. full : 'c rule;
      (*

      full rule. rest is a suffix of full.

      *)
    } -> 'r final

Type of an active element of the earley table. In a description of earley, each table element is (start, end, done * rest) meaning we parsed the string from start to end with the rule done and it remains to parse rest. The * therefore denotes the current position. Earley is basically a dynamic algorithm producing all possible elements.

We depart from this representation in two ways:

  • we do not represent done, we keep the whole whole rule: full = done rest
  • we never keep end. It is only used when we finish parsing of a rule and we have an element (start, end, done * Empty) then, we look for other elements of the form (start', end', done' * rest') where end' = start, rest' = nonterm' rest'' and nonterm' is a non terminal containing the done rule. We represent this situation by a stack in the element (start, end, done * Empty), which is maintained to lists all the elements satisfying the above property (no more, no less, each element only once)

The type 'a final represent an element of the earley table where end is the current position in the string being parsed.

and (_, _) element =
  1. | B : ('a -> 'r) pos -> ('a, 'r) element
    (*

    End of the stack, with an action to produce the final parser value

    *)
  2. | C : {
    1. start : pos2;
    2. stack : ('c, 'r) stack;
    3. acts : ('a -> 'b -> 'c) pos;
      (*

      Here we wait x:'a from parents

      *)
    4. rest : 'b rule;
    5. full : 'c rule;
    } -> ('a, 'r) element
    (*

    Cons cell of the stack with a record similar to D's

    *)

Type of the element that appears in stacks. Note: all other elements no reachable from stacks will be collected by the GC, which is what we want to release memory.

The type is similar to the previous: ('a, 'r) element, means that from a value of type 'a, comming from our parent in the stack, we could produce a value of type 'r using rest.

The action needs to be prametrised by the future position which is unknown.

and ('a, 'b) stack = ('a, 'b) element list Stdlib.ref

Stack themselves are an acyclic graph of elements (sharing is important to be preserved). We need a reference for the stack construction.

For the construction of the stack, all elements of the same list ref of type ('a,'b) stack have the same en position end'. And all elements that points to this stack have their begin position start = end'. Moreover, all elements with the same full and start must point to the same stack. Recall that end is not represented in elements.

We call the position end' associated to a stack (as said the start of the element that point to this stack, the "stack position". An important point: new stack elements are constructed when the stack position is the current position.

Moreover, when we add a point from an element (start, end, rest, full) to a stack (which is therefore at position start, we have start = end and rest = full. The rule has not parsed anything! This is the "prediction" phase of earley.

To do this construction, we use the record below with a hook that we run on all elements added to that stack. This record is only used with stack whose position are the current position: all these records will become inaccessible when we advance in parsing.

Morevoer, a direct pointer (thanks to the Container module) is kept from the full rule of all elements that point to these stack and that have the current position as end. This is the role of the functor call below.

type ('a, 'b) pre_stack = {
  1. stack : ('a, 'b) stack;
  2. mutable hooks : (('a, 'b) element -> unit) list;
}

stack in construction ... they have a hook, to run on elements added to the stack later !