package fmlib_parse
Install
    
    dune-project
 Dependency
Authors
Maintainers
Sources
sha256=22d5648fe40ab6c6106ee4c7092db6f3547b721c299092dd12f092a307c8ac59
    
    
  doc/parse_lex.html
Separation of Parsing and Lexing
Overview
In many cases it is appropriate to separate parsing and lexing. A lexer breaks up the input stream into tokens like identifiers, parentheses, numbers, strings etc. Furthermore usually the lexer strips off whitespace. The parser handles the grammar of the language by using the tokens as primitives.
This approach has several advantages:
- For a real language the complexity of parsing a source file is separated into two managable sized parts.
- Handling whitespace in the parser makes the parser unnecessarily complex.
- As soon as a language has identifiers and keywords where the keywords look syntactically like identifiers, a parser handling characters directly requires a lot of backtracking which makes the parser inefficient. A lexer can recognize identifiers and after successful recognition of an identifier it checks by using an efficient lookup table if the identifier is a keyword.
However many combinator libraries do not offer the possibility to split up the parsing task into a lexer and a parser. `Fmlib_parse` supports the splitting up of lexing and parsing with a lot of functionality.
How to write a lexer
A lexer analyzes the input stream consisting of characters in the following way:
WS Token WS Token WS .... WS EOS
where WS is a possibly empty sequence of whitespace like blanks, tabs, newlines, comments etc. Token is a lexically correct token. EOS represents the end of the input stream.
Since the lexer has to succeed immediately after recognizing a syntactically correct token it is not a normal parser which succeeds only after having seen the end of input. Therefore a lexer is a partial parser. After having successfully recognized a token the lexer must be restartable to recognize the next token or to recognize the end of input.
The easiest way to write a lexer with the help of Fmlib_parse is to use Fmlib_parse.Character by doing the following steps:
- Define a module - Tokenand- Token_plusof the following form:- module Token = struct type t = T1 of ... T2 of ... ... End (* end of input *) ... end module Token_plus = struct type t = Position.range * Token end
- Write a module which satisfies the interface - Fmlib_parse.Interfaces.LEXER.- module Lexer = struct module C = struct include Character.Make (Unit) (* Trivial user state *) (Token_plus) (Fmlib_std.Void) (* No semantic error possible *) let ws: _ t = ... (* combinator recognizing optional but arbitrarily long whitespace *) Basic.skip_zero_or_more (...) let tok: Token.t t = ... (* Combinator recognizing tokens. *) let final: Token_plus.t t = C.lexer ws eos tok end (* Public Functions *) include C.Parser let start: t = (* Recognize the first token *) C.make_partial Position.start () C.final let restart (lex: t): t = (* Recognize subsequent tokens *) assert (has_succeeded lex); assert (not (has_consumed_end lex)); C.make_partial (position lex) () C.final |> transfer_lookahead lex end
- Note that the function - Fmlib_parse.Character.Make.lexerhas the following definition- let lexer (ws: _ t) (eos: Token.t) (tok: Token.t t) : Token_plus.t = let* _ = ws in located ( tok </> expect_end eos )- It first strips off whitespace and then it expects either a token or the end of input. The token or the end of input is returned with the corresponding position information. This functionality is usually expected from a lexer. However you can write your own combinator if you want to have a different behaviour. When you write your own function, be careful where to put - Fmlib_parse.Character.Make.expect_end.
Look into https://github.com/hbr/fmlib/blob/master/src/parse/test_json.ml to see an example with a simple json parser on how it works.
How to write a parser
- Write a module StatewhereState.trepresents the state of your parser. If you don't need a state, then useUnit.
- Write a module SemanticwhereSemantic.tit the type of semantic errors. If your parser issues only syntax errors, then useFmlib_std.Void.
- Write a module FinalwhereFinal.trepresents the structure you want to parse.
- Finally write the module representing the parser using - Fmlib_parse.Token_parserwhich uses- Token.tas the primitive tokens. Look into the same example as above.- module Parser = struct module C = struct include Token_parser.Make (State) (Token) (Final) (Semantic) ... let final: Final.t t = ... end (* Public Functions *) include C.Parser let token_parser: t = make State.start final end
How to wire the lexer and the parser
The final parse looks like
    module Parse_lex =
    struct
        include
            Parse_with_lexer
                (State)
                (Token)
                (Final)
                (Semantic)
                (Lexer)
                (Parser)
        let start: t =
            make Lexer.start Parser.token_parser
    endusing Fmlib_parse.Parse_with_lexer to generate the final parser which scans a stream of characters breaks the input up into tokens by using the lexer and analyzes the grammar by using the token parser. See same example as above.