package pacomb

  1. Overview
  2. Docs

Lexing: grouping characters before parsing

It is traditionnal to do parsing in two phases (scanning/parsing). This is not necessary with combinators in general (scannerless). This is still true with Pacomb. However, this makes the grammar more readable to use a lexing phase.

Moreover, lexing is often done with a longuest match rule that is not semantically equivalent to the semantics of context free grammar.

This modules provide combinator to create terminals that the parser will call.

Types and exception

type buf = Input.buffer

Position in a buffer is a Input.buffer together with an index Input.pos.

type idx = Input.idx
type 'a lexeme = buf -> idx -> 'a * buf * idx

Type of terminal function, similar to blank, but with a returned value

type _ ast =
  1. | Any : char ast
  2. | Any_utf8 : Uchar.t ast
  3. | Any_grapheme : string ast
  4. | Eof : unit ast
  5. | Char : char -> unit ast
  6. | Grapheme : string -> unit ast
  7. | String : string -> unit ast
  8. | Nat : int ast
  9. | Int : int ast
  10. | Float : float ast
  11. | CharLit : char ast
  12. | StringLit : string ast
  13. | Test : (char -> bool) -> char ast
  14. | NotTest : (char -> bool) -> unit ast
  15. | Seq : 'a t * 'b t * ('a -> 'b -> 'c) * 'c Assoc.key -> 'c ast
  16. | Alt : 'a t * 'a t -> 'a ast
  17. | Save : 'a t * (string -> 'a -> 'b) * 'b Assoc.key -> 'b ast
  18. | Option : 'a * 'a t -> 'a ast
  19. | Appl : 'a t * ('a -> 'b) * 'b Assoc.key -> 'b ast
  20. | Star : 'a t * (unit -> 'b) * ('b -> 'a -> 'b) * 'b Assoc.key -> 'b ast
  21. | Plus : 'a t * (unit -> 'b) * ('b -> 'a -> 'b) * 'b Assoc.key -> 'b ast
  22. | Sub : 'a t * ('a -> bool) * 'a Assoc.key -> 'a ast
  23. | Keyword : string * int -> unit ast
  24. | Custom : 'a lexeme * 'a Assoc.key -> 'a ast

ast for terminals, needed for equality

and 'a terminal = {
  1. n : string;
    (*

    name

    *)
  2. f : 'a lexeme;
    (*

    the terminal itself

    *)
  3. a : 'a ast;
  4. c : Charset.t;
    (*

    the set of characters accepted at the beginning of input

    *)
}

The previous types encapsulated in a record

and 'a t = 'a terminal

Abbreviation

exception NoParse

exception when failing,

  • can be raised (but not captured) by terminals
  • can be raised (but not captured) by action code in the grammar, see Combinator.give_up
  • will be raised and captured by Combinator.parse_buffer that will give the most advanced position
exception Give_up of string

from action ony may give an error message when rejecting a rule

val give_up : ?msg:string -> unit -> 'a

give_up () rejects parsing from a corresponding semantic action. An error message can be provided. Can be used both in the semantics of terminals and parsing rules.

Combinators to create terminals

val any : ?name:string -> unit -> char t

accept any character, except eof

val eof : ?name:string -> unit -> unit t

Terminal accepting the end of a buffer only. remark: eof is automatically added at the end of a grammar by Combinator.parse_buffer. name default is "EOF"

val char : ?name:string -> char -> unit t

Terminal accepting a given char, remark: char '\255' is equivalent to eof. name default is the given charater.

val test : ?name:string -> (char -> bool) -> char t

Accept any character for which the test returns true. name default to the result of Charset.show.

val charset : ?name:string -> Charset.t -> char t

Accept a character in the given charset. name default as in test

val not_test : ?name:string -> (char -> bool) -> unit t

Reject the input (raises Noparse) if the first character of the input passed the test. Does not read the character if the test fails. name default to "^" prepended to the result of Charset.show.

val not_charset : ?name:string -> Charset.t -> unit t

Reject the input (raises Noparse) if the first character of the input is in the charset. Does not read the character if not in the charset. name default as in not_test

val sub : ?name:string -> ?charset:Charset.t -> 'a t -> ('a -> bool) -> 'a t

Does a test on the result of a given lexer and reject if it returns false. You may provide a restricted charset for the set of charaters accepted in the initial position.

val seq : ?name:string -> 'a t -> 'b t -> ('a -> 'b -> 'c) -> 'c t

Compose two terminals in sequence. name default is the concatenation of the two names.

val seq1 : ?name:string -> 'a t -> 'b t -> 'a t

variation on the above

val seq2 : ?name:string -> 'a t -> 'b t -> 'b t
val seqs : 'a t list -> ('a -> 'a -> 'a) -> 'a t
val save : ?name:string -> 'a t -> (string -> 'a -> 'b) -> 'b t

save t f save the part of the input parsed by the terminal t and combine it with its semantics using f

val alt : ?name:string -> 'a t -> 'a t -> 'a t

alt t1 t2 parses the input with t1 or t2. Contrary to grammars, terminals does not use continuations, if t1 succeds, no backtrack will be performed to try t2. For instance,

seq1 (alt (char 'a' ())
         (seq1 (char 'a' ()) (char 'b' ())))
    (char 'b' ())

will reject "ab". If both t1 and t2 accept the input, longuest match is selected. name default to sprintf "(%s)|(%s)" t1.n t2.n.

val alts : 'a t list -> 'a t
val option : ?name:string -> 'a -> 'a t -> 'a t

option x t parses the given terminal 0 or 1 time. x is returned if 0. name defaults to sprintf "(%s)?" t.n.

val appl : ?name:string -> ('a -> 'b) -> 'a t -> 'b t

Applies a function to the result of the given terminal. name defaults to the terminal name.

val star : ?name:string -> 'a t -> (unit -> 'b) -> ('b -> 'a -> 'b) -> 'b t

star t a f Repetition of a given terminal 0,1 or more times. The type of function to compose the action allows for 'b = Buffer.t for efficiency. The returned value is f ( ... (f(f (a ()) x_1) x_2) ...) x_n if t returns x_1 ... x_n. name defaults to sprintf "(%s)*" t.n

val plus : ?name:string -> 'a t -> (unit -> 'b) -> ('b -> 'a -> 'b) -> 'b t

Same as above but parses at least once.

val string : ?name:string -> string -> unit t

string s Accepts only the given string. Raises Invalid_argument if s = "". name defaults to sprintf "%S" s.

val nat : ?name:string -> unit -> int t

Parses an natural in base 10. "-42" and "-42" are not accepted. name defaults to "NAT"

val int : ?name:string -> unit -> int t

Parses an integer in base 10. "+42" is accepted. name defaults to "INT"

val float : ?name:string -> unit -> float t

Parses a float in base 10. ".1" is accepted as "0.1" name defaults to "FLOAT"

val char_lit : ?name:string -> unit -> char t

Parses a char litteral 'c' using ocaml escaping convention name defaults to "CHARLIT"

val string_lit : ?name:string -> unit -> string t

Parses a string litteral "cccc" using ocaml escaping convention name defaults to "STRINGLIT"

val any_utf8 : ?name:string -> unit -> Uchar.t t

Parses a unicode UTF8 char name defaults to "UTF8"

val utf8 : ?name:string -> Uchar.t -> unit t

utf8 c parses a specific unicode char and returns (), name defaults to the string representing the char

val any_grapheme : ?name:string -> unit -> string t

Parses any utf8 grapheme. name defaults to "GRAPHEME"

val grapheme : ?name:string -> string -> unit t

grapheme s parses the given utf8 grapheme and return (). The difference with string s x is that if the input starts with a grapheme s' such that s is a strict prefix of s', parsing will fail. name defaults to "GRAPHEME("^s^")"

val accept_empty : 'a t -> bool

Test wether a terminal accept the empty string. Such a terminal are illegal in a grammar, but may be used in combinator below to create terminals

val test_from_lex : bool t -> buf -> idx -> buf -> idx -> bool

Test constructor for the test constructor in Grammar

val blank_test_from_lex : bool t -> buf -> idx -> buf -> idx -> bool
val eq : 'a t -> 'b t -> ('a, 'b) Assoc.eq

equality, incomplete in particular for "alt"

val custom : 'a lexeme -> 'a ast

If you build custom lexeme, you need to use this to fill the a field of the record

val default : 'a -> 'a option -> 'a

where to put it ...

OCaml

Innovation. Community. Security.