earley

Parsing library based on Earley Algorithm
IN THIS PACKAGE
val debug_lvl : int ref

Flags.

val warn_merge : bool ref
val log : ( 'a, out_channel, unit ) format -> 'a
exception Error

Exception raised for parse error. Can be raise in the action of a grammar using give_up

val idt : 'a -> 'b

identity is often used

type blank = Input.buffer -> int -> Input.buffer * int

A blank function is just a function progressing in a buffer

type position = Input.buffer * int

Positions

type 'a fpos = Input.buffer -> int -> Input.buffer -> int -> 'a

type of a function waiting for positions

type _ pos =
| Idt : ( 'a -> 'a ) pos
| Simple : 'a -> 'a pos
| WithPos : 'a fpos -> 'a pos
| Error : 'a pos

Type for action with or without position and its combinators

val apply_pos : 'a. 'a pos -> 'a fpos

Common combinators, easy from their types

type 'a input = Input.buffer -> int -> 'a * Input.buffer * int

For terminals: get the start position and returns a value with the final position

type errpos = (Input.buffer * int) option ref

A reference to record the last error useful when a terminal calls another parsing

type 'a input2 = errpos -> blank -> Input.buffer -> int -> 'a input

Type for Ter2 terminals: get both the position before and after blank

type 'a test = ('a * bool) fpos

Type for tests: get also both position and return a boolean and a value

type pos2 = {
buf : Input.buffer;
col : int;
buf_ab : Input.buffer;
col_ab : int;
}

Position record stored in the elements of the earley table. We store the position before and after the blank

val eq_pos : pos2 -> pos2 -> bool

Some function on pos2

val apply_pos_start : 'a pos -> pos2 -> pos2 -> 'b

Get the position before and after the parsed text annd apply it to the combinator

val apply_start : ( Input.buffer -> int -> Input.buffer -> int -> 'a ) -> pos2 -> pos2 -> 'b
type info = (bool * Charset.t) Utils.Fixpoint.t

Type of the information computed on a rule: the boolean tells if the grammar can parse an empty string and the charset, the first accepted characteres when the rule is used to parse something.

THE MAIN TYPES

type 'a cref = 'a Container.Ref.container

Use the container references, because they can be reset

type 'a tref = ('a * Input.buffer * int) option cref

a reference to the store the result of reading a terminal

module Types : sig ... end

A BNF grammar is a list of rules. The type parameter 'a corresponds to the type of the semantics of the grammar. For example, parsing using a grammar of type int grammar will produce a value of type int.

module StackContainer : Container.Param with type ('b, 'a) elt = ( 'a, 'b ) Types.pre_stack

Use container to store a point to the rule in stack, we use recursive module for that

include module type of struct include Types end

A BNF grammar is a list of rules. The type parameter 'a corresponds to the type of the semantics of the grammar. For example, parsing using a grammar of type int grammar will produce a value of type int.

type 'a grammar = info * 'a rule list

The type of a grammar, with its information

and _ symbol = _ Types.symbol =
| Term : {
input : 'a input;
info : info;
memo : 'a tref;
name : string;
} -> 'a symbol
(*

terminal symbol just read the input buffer

*)
| Ter2 : {
input : 'a input2;
info : info;
memo : 'a tref;
name : string;
} -> 'a symbol
(*

Ter2 correspond to a recursive call to the parser. We can change the blank function for instance, or parse input as much as possible. In fact it is only in the combinators in earley.ml that we see the use Ter2 to call the parser back.

*)
| Test : {
input : 'a test;
info : info;
memo : 'a tref;
name : string;
} -> 'a symbol
(*

Test on the input, can for instance read blanks, usefull for things like ocamldoc (but not yet used by earley-ocaml).

*)
| NonTerm : {
mutable rules : 'a rule list;
mutable info : info;
memo : 'a ntref;
name : string;
} -> 'a symbol
(*

Non terminals trough a mutable field to define recursive rule lists

*)

The symbol, a more general concept that terminals

and 'a ntref = 'a rule list cref

Type of a container ref to memoize treatment of non terminal

and _ prerule = _ Types.prerule =
| Empty : 'a pos -> 'a prerule(*

Empty rule.

*)
| Next : info * 'a symbol * ( 'a, 'c ) next -> 'c prerule(*

Sequence of a symbol and a rule, with a possible name for debugging, the information on the rule, the symbol to read, an action and the rest of the rule

*)
| Dep : ( 'a -> 'b rule ) -> ( 'a -> 'b ) prerule(*

Dependant rule, gives a lot of power! but costly! use only when no other solution is possible

*)

BNF rule.

and ('a, 'c) next = ( 'a, 'c ) Types.next =
| Arg : ( 'a -> 'c ) rule -> ( 'a, 'c ) next
| Pos : ( 'a -> 'c ) fpos rule -> ( 'a, 'c ) next
| Ign : 'c rule -> ( 'a, 'c ) next
and 'a rule = 'a Types.rule = {
rule : 'a prerule;
ptr : 'a StackContainer.container;
adr : int;
}

Each rule holds a container to associate data to the rule in O(1). see below the description of the type ('a,'b) pre_stack

type _ final = _ Types.final =
| D : {
start : pos2;(*

position in buffer, before and after blank

*)
stack : ( 'c, 'r ) stack;(*

tree of stack representing what should be done after reading the rest of the rule

*)
acts : 'b -> 'c;(*

action to produce the final 'c.

*)
rest : 'b rule;(*

remaining to parse, will produce 'b

*)
full : 'c rule;(*

full rule. rest is a suffix of full.

*)
} -> 'r final

Type of an active element of the earley table. In a description of earley, each table element is (start, end, done * rest) meaning we parsed the string from start to end with the rule done and it remains to parse rest. The * therefore denotes the current position. Earley is basically a dynamic algorithm producing all possible elements.

We depart from this representation in two ways:

  • we do not represent done, we keep the whole whole rule: full = done rest
  • we never keep end. It is only used when we finish parsing of a rule and we have an element (start, end, done * Empty) then, we look for other elements of the form (start', end', done' * rest') where end' = start, rest' = nonterm' rest'' and nonterm' is a non terminal containing the done rule. We represent this situation by a stack in the element (start, end, done * Empty), which is maintained to lists all the elements satisfying the above property (no more, no less, each element only once)

The type 'a final represent an element of the earley table where end is the current position in the string being parsed.

and (_, _) element = ( _, _ ) Types.element =
| B : ( 'a -> 'r ) pos -> ( 'a, 'r ) element(*

End of the stack, with an action to produce the final parser value

*)
| C : {
start : pos2;
stack : ( 'c, 'r ) stack;
acts : ( 'a -> 'b -> 'c ) pos;(*

Here we wait x:'a from parents

*)
rest : 'b rule;
full : 'c rule;
} -> ( 'a, 'r ) element
(*

Cons cell of the stack with a record similar to D's

*)

Type of the element that appears in stacks. Note: all other elements no reachable from stacks will be collected by the GC, which is what we want to release memory.

The type is similar to the previous: ('a, 'r) element, means that from a value of type 'a, comming from our parent in the stack, we could produce a value of type 'r using rest.

The action needs to be prametrised by the future position which is unknown.

and ('a, 'b) stack = ( 'a, 'b ) element list ref

Stack themselves are an acyclic graph of elements (sharing is important to be preserved). We need a reference for the stack construction.

For the construction of the stack, all elements of the same list ref of type ('a,'b) stack have the same en position end'. And all elements that points to this stack have their begin position start = end'. Moreover, all elements with the same full and start must point to the same stack. Recall that end is not represented in elements.

We call the position end' associated to a stack (as said the start of the element that point to this stack, the "stack position". An important point: new stack elements are constructed when the stack position is the current position.

Moreover, when we add a point from an element (start, end, rest, full) to a stack (which is therefore at position start, we have start = end and rest = full. The rule has not parsed anything! This is the "prediction" phase of earley.

To do this construction, we use the record below with a hook that we run on all elements added to that stack. This record is only used with stack whose position are the current position: all these records will become inaccessible when we advance in parsing.

Morevoer, a direct pointer (thanks to the Container module) is kept from the full rule of all elements that point to these stack and that have the current position as end. This is the role of the functor call below.

type ('a, 'b) pre_stack = ( 'a, 'b ) Types.pre_stack = {
stack : ( 'a, 'b ) stack;
mutable hooks : ( ( 'a, 'b ) element -> unit ) list;
}

stack in construction ... they have a hook, to run on elements added to the stack later !

Recall the INVARIANTS:

1° Consider two C elements (or two D elements) of a stack. If their have the same start and full is means we can to the same after a complete parsing of the rule.

Then, the two elements MUST HAVE PHYSICALLY EQUAL stack

2° For D nodes, we keep only one for each (start, rest, full) triple with a merge warning when a node only differs by its actions (compared looking inside closure).

So D nodes with the same (start, rest, full) triple always have equal action. This propagates to each stack (C node with the same (start, rest, full) in the same stack have phisically equal actions.

val count_rule : int ref

Counter defined outside mkrule due to value restriction.

val mkrule : 'a. 'a prerule -> 'a rule

A function to build rule from pre_rule

val eq_rule : 'a 'b. 'a rule -> 'b rule -> ( 'a, 'b ) Container.eq

Rule equlity

val eq_C : 'a 'b. ( 'a, 'b ) element -> ( 'a, 'b ) element -> bool

Equality for stack element: as we keep the invariant, we only need physical equality. You may uncomment the code below to check this.

val eq_D : 'a final -> 'b final -> bool

Equality on a final needs to do a comparison as it is used to test if a new element is already present.

val idtEmpty : 'a. unit -> ( 'a -> 'a ) rule

Some rules/grammar contruction that we need already here

val symbol_name : 'a. 'a symbol -> string

Handling of symbol, rule and grammar names

val rule_name : 'a. 'a rule -> string
val grammar_name : 'a. ?delim:bool -> 'a grammar -> string
val grammar_delim_name : 'a. 'a grammar -> string
val keep_all_names : bool ref
val grammar_to_rule : 'a. ?name:string -> 'a grammar -> 'a rule

Converting a grammar to a rule

val force : 'a Utils.Fixpoint.t -> 'a

Basic constants/functions for rule information

val iempty : (bool * Charset.charset) Utils.Fixpoint.t
val rule_info : 'a. 'a rule -> info

managment of info = accept empty + charset accepted as first char

val symbol_info : 'a. 'a symbol -> info
val compose_info : 'a symbol -> 'b rule -> info
val grammar_info : 'a. 'a rule list -> info
val print_pos : out_channel -> pos2 -> unit
val print_final_no_pos : out_channel -> 'a final -> unit
val print_final : out_channel -> 'a final -> unit
val print_element : 'a 'b. out_channel -> ( 'a, 'b ) element -> unit
val print_rule : out_channel -> 'a rule -> unit

Here are the 3 type for tables used by out algorithm

module K : sig ... end

This type is the state of the parsing table for the current position it only hold 'a final elements whose end are the current position, the keys are the buffer uid's, column position and uid of the full and rest rules

module HK : sig ... end
type 'a cur = 'a final HK.t
type 'a reads = 'a final Input.OrdTbl.t ref

Type of a table with pending reading, that is elements resulting from reading the string with some symbols. We need this table, because two terminal symbols could read different length of the input from the same point

type 'a sct = 'a StackContainer.table

Stack construction. The type below, denotes table associate stack to rule. Recall we construct stack for elements whose end are the current position

type tmemo = unit Container.Ref.table

Table to memoize the treatment of terminals and non terminals. This is important to garanty that results of actions are physicaly equals when they comes from the same parsing. To do this, terminals must read the input only once and the result must be memoized.

val add_stack_hook : 'a 'b. 'b sct -> 'a rule -> ( ( 'a, 'b ) element -> unit ) -> unit

add_stack_hook sct rule fn adds in table a hook fn for the given rule. fn will be called each time an element is added to the stack pointed by that rule. The hook is run on the existing elements if the stack. The stack is created if it did not exists yet

val add_stack : 'a 'b. 'b sct -> 'a rule -> ( 'a, 'b ) element -> ( 'a, 'b ) stack

add_stack sct rule element add the given element to the stack of the given rule in the table sct. Runs all hooks if any. Creates the stack if needed

val find_stack : 'a 'b. 'b sct -> 'a rule -> ( 'a, 'b ) stack

find_stack sct rule finds the stack to associate to the given rule, again the stack is created if needed

val size_tables : ( 'a, 'b final ) Hashtbl.t -> 'c final Input.OrdTbl.t -> int

Computes the size of the two tables, forward reading and the current elements. The other tables do not retain stack pointers

val size : 'a final -> int

wrap up the size function

val tail_key : 'a. 'a rule -> int

A fonction to fetch the key of the tail of a rule, needed to get the key of an element representing a complete parsing

val final : 'b 'c 'r. pos2 -> ( 'b -> 'c ) -> ( 'c, 'r ) stack -> 'b rule -> 'c rule -> 'r final

Constructor for the final type which includes its hash ley

val elt_key : 'a final -> int * int * int

Get the key of an element

val final_key : pos2 -> 'a rule -> int * int * int

Get the key of the final element when parsing is finished

val good : char -> 'a rule -> bool

Test is a given char is accepted by the given rule

val max_stack : int ref
val add : string -> pos2 -> char -> 'a final -> 'a cur -> bool

Adds an element in the current table of elements, return true if new

Combinators for actions, these are just the combinators we need, contructed from there types

val cns : 'a 'b 'c. 'a -> ( 'b -> 'c ) -> ( 'a -> 'b ) -> 'c

This one for completion

val cns_pos : 'a 'b 'c. ( 'a -> ( 'b -> 'c ) -> ( 'a -> 'b ) fpos -> 'c ) fpos
val cns_ign : 'a 'b 'c. 'a -> ( 'b -> 'c ) -> 'b -> 'c
val combine : 'a 'c 'd. ( 'c -> 'd ) -> ( 'a -> ( 'a -> 'c ) -> 'd ) pos

This one for prediction

val combine_pos : 'a 'c 'd. ( 'c -> 'd ) -> ( 'a -> ( 'a -> 'c ) fpos -> 'd ) pos
val combine_ign : 'a 'c 'd. ( 'c -> 'd ) -> ( 'a -> 'c -> 'd ) pos
exception Parse_error of Input.buffer * int

Exception for parse error, can also be raise by Ter2 terminals, but no other terminal handles it

val update_errpos : (Input.buffer * int) option ref -> Input.buffer -> int -> unit
val pred_prod_lec : 'a. errpos -> 'a final -> 'a cur -> 'a reads -> 'a sct -> tmemo -> blank -> pos2 -> char -> unit

This is now the main function computing all the consequences of the element given at first argument. It needs

  • the element elt0 being added
  • the errpos to record advaces in the parsing
  • the four tables: elements forward sct tmemo
  • the blank (to pass it to Ter2 terminals)
  • the position cur_pos and current charater c for the action and the good test

It performs prediction/completion/lecture in a recursive way.

val count : int ref
val parse_buffer_aux : 'a. ?errpos:errpos -> bool -> blank -> 'a grammar -> Input.buffer -> int -> 'a * Input.buffer * int

The main parsing loop

val internal_parse_buffer : 'a. ?errpos:errpos -> ?blank_after:bool -> blank -> 'a grammar -> Input.buffer -> int -> 'a * Input.buffer * int

Function to call the parser in a terminal