Indentation Sensitivity
Up
Situation
Most programming languages express hierarchical structures by some kind of parentheses. Algol like languages use begin end, C like languages use curly braces {, } to enclose blocks of code. Since blocks can be nested inside blocks, the hierarchical or tree structure is well expressed by the syntax.
For the human reader blocks are usually indented to make the hierarchical structure graphically visible. Programming languages like Haskell and Python ommit the parentheses and express the hierarchical structure by indentation. I.e. the indentation is part of the grammar. This is pleasing to the eye, because many parentheses can be ommitted.
The hierarchical structure in the following schematical source file is immediately visible without the need of parentheses.
    xxxxxxxxxxx
        xxx
        xxx
            xxxxxxx
    xxxxxxxx
        xxxLower level blocks are indented with respect to their parent block and siblings at the same level are vertically aligned.
Because of this good readability configuration languages like yaml have become very popular.
Unfortunately there are not many parsers available which support indentation sensitivity. The library Fmlib_parse has support to parse languages whose grammar uses indentation to structure blocks hierarchically.
Indentation Combinators
In order to support indentation sensitivity, all constructs parsed by combinators have a certain indentation. Usually the identation of a construct is the start column of its leftmost token. For aligned constructs the first token must be the leftmost token.
Indented Block
If p is a combinator parsing a certain construct, then
indent 4 p
is a combinator which parses p indented at least 4 columns with respect to its parent.
Aligned Blocks
In order to align blocks vertically we need at least two combinators p and q whose constructs have to be vertically aligned.
The combinator
    (
        let* a = align p in
        let* b = align q in
        return (a, b)
    )
    |> indent 4         (* Note: Indentation of the whole block is
                                 mandatory!! Never forget!! *)
parses a construct with the structure
    xxxxxx
        pppppp
        qqqqqqwhere xxxxxx belongs to the outer structure.
It is straightforward to parse a list of aligned constructs where the combinator p parses an individual item.
    one_or_more (align p) |> indent 0
Note that some indentation is always necessary. The indentation can be zero. If no indentation is given then the parser tries to align the items recognized by p vertically below the leftmost token of the surrounding construct. This is usually not intended and a common pitfall.
Ignore Indentation
Sometimes it is necessary to ignore the indentation. The combinator
detach p
parses a construct described by the combinator p and ignores all indentation and alignment requirements at that position. This combinator is necessary in order to parse whitespace, because whitespace at the start of a line does not respect any indentation or alignment requirement.
Error Reporting
With layout parsing a syntax error can have two reasons:
- A token is not the expected one.
- A token does not start at an allowed position.
When a parser fails with a syntax error, it reports a list of failed expectations. Each failed expectations has two parts. A message describing the syntactic construct expected an an optional indentation expectation. If the indentation expectation is present, then the syntax error occurred because the token appeared at a not allowed position.
Because of the presence of biased choice and optional constructs in the combinators, multiple syntax expectations can fail. Therefore the parsers return a list of syntax expectations in case of a syntax error. In order make the expectations more informative for the user, it is common to use constructs like
    p </> q </> r <?> "some higher level expectation"
The <?> works as expected with layout parsing when the operator is used inside the indent and alignment combinators. When positioned outside, the parser works correctly, but the error messages might be misleading.
    (* Correct placement of '<?>' *)
    p
    </>
    q
    <?> "p or q"
    |> align
    |> one_or_more
    (* Wrong placement of '<?>' *)
    p
    </>
    q
    |> align
    <?> "p or q"
    |> one_or_more
If p and q fail because they are not properly aligned, their expectations are reported with the correct failed alignment expectation. In the correct sequence the operator <?> still sees the failed alignment and reports the more expressive expectation with the same failed alignment expectation. In the wrong sequence the <?> operator does not see the alignment requirement any more and reports the more expressive expectation without the failed alignment requirement.
Example
In this example we write a parser which parses a very simplified subset of yaml.
Requirements
Here are some examples of yaml structures and the corresponding json structure which the parser should recognize
    Hello Mr. Spock # some comment
        json: "Hello Mr. Spock"
    "Hello: Blabla#"
        json: "Hello: Blabla#"
    - - - hello
        json: [[["hello"]]]
    1: 11: hello
       12: ""
       "###": hash
        json: {"1": {"11": "hello", "12": "", "###": "hash"}}
    k1:
    - 1
    - - 1.1
      - 1.2
    k2: s2
        json: {"k1": ["1", ["1.1", "1.2"]], "k2": "s2"}Scalars and keys in key value pairs come in two flavors. Either a sequence of characters spanning to a colon or a newline or a sequence of characters enclosed in double quotes. We don't treat escape sequences and strings in single quotes here. Here we focus on the implementation of the indentation requirements.
Note that key value pairs have to be indented, if they occur as a substructure. Lists as a value of a key value pair do not need to be indented. The -  is sufficient to indicate the start of a list.
Yaml Structure
The final construct of the parser shall be a yaml value according to the following module:
    module Yaml =
    struct
        type t =
            | Scalar of string
            | List   of t list
            | Record of (string * t) list
        let scalar str = Scalar str
        let list lst = List lst
        let record lst = Record lst
        let rec to_json: t -> string =
            (* recursive function to map a yaml value into a json string
               for testing purposes *)
    end
Basic Combinators
We parse the yaml structure with the lexerless character parser.
    module CP = Character.Make (Unit) (Yaml) (Unit)
    open CP
A yaml comment starts with a hash sign # and spans to the end of the line.
    let comment: char t =
        let* _ = char '#' in
        let* _ =
            (charp (fun c -> c <> '\n') "comment character")
            |> skip_zero_or_more
        in
        return '#'
Whitespace is any sequence of zero or more blanks, newlines and comments.
    let whitespace: int t =
        char ' ' </> char '\n' </> comment
        |> skip_zero_or_more
        |> no_expectations  (* no expected whitespace in syntax errors *)
        |> detach           (* whitespace does not repect indentation *)
    let lexeme (p: 'a t): 'a t =
        (* Remove whitespace after 'p' *)
        let* a = p in
        let* _ = whitespace in
        return a
In order to handle scalars and keys we need raw strings spanning to a colon or a newline and quoted strings.
    let raw_string: string t =
        let expect  = "chars not containing colon and newline" in
        let inner c = c <> '\n' && c <> ':' && c <> '#' in
        let first c = c <> '"' && inner c
        in
        (word first inner expect)
        |> map String.trim
        |> lexeme
    let quoted_string: string t =
        let expect = "chars except newline and dquote" in
        let ok c = c <> '\n' && c <> '"'
        in
        let* _   = char '"' in
        let* str = (word ok ok expect) </> return "" in
        let* _   = char '"' |> lexeme
        in
        return str
    let scalar: Yaml.t t =
        (quoted_string </> raw_string)
        |> map Yaml.scalar
        <?> "scalar"
The start of a list element is indicated by -. We use the combinator dash to recognize the start of a list element.
    let dash: char t =
        char '-' |> lexeme
The key in a key value pair is either a raw string or a quoted string followed by a colon. The combinator recognizing a key has to backtrack, because a string not followed by a colon can still be a yaml scalar.
    let key: string t =
        backtrack
            (
                let* str = raw_string </> quoted_string in
                let* _   = char ':' |> lexeme in
                return str
            )
            "<key>:"
Recursive Yaml Parsing
Now we have to parse recursively according to the recursive definition of a yaml structure. A yaml value is either a scalar, a sequence (i.e. list) of items or a record of key value pairs.
    let rec yaml (): Yaml.t t =
        sequence_block ()
        </>
        record_block ()
        </>
        scalar
    and sequence_block ...
    and record_block ...
The first alternative is the sequence, because a sequence can be easily recognized by the first character -. If something doesn't start with a dash, then it is certainly not a sequence and the remaining alternatives can be tried.
We have to try to find a record before a scalar because a record can start with a scalar which represents the key of the first key value pair. If there is no colon after the key, then a pure scalar can be tried. Therefore we have made the combinator key backtrackable in order to fail without consuming any character.
A sequence block is a list of one or more aligned sequence elements. The implementation is straightforward.
    and sequence_block (): Yaml.t t =
        one_or_more
            (
                sequence_element ()
                <?>
                "list element: \"- <yaml value>\""
                |> align
            )
        |> map (fun (a, lst) -> Yaml.list (a :: lst))
        <?> "sequence of aligned \"- <yaml value>\""
    and sequence_element (): Yaml.t t =
        let* _ = dash in
        yaml () |> indent 1
Note that all sequence elements are vertically aligned and the yaml values after the dash have to be indented at least by one column.
        # legal sequence element            # illegal sequence element
        -                                   -
          k1: 100                           k1: 100
          k2: 200                           k2: 200A record block is a sequence of aligned key value pairs.
    and record_block (): Yaml.t t =
        one_or_more
            (
                record_element ()
                <?> "\"<key>: <yaml value>\""
                |> align
            )
        |> map (fun (a, lst) -> Yaml.record (a :: lst))
        <?> "sequence of aligned \"<key>: <yaml value>\""
    and record_element (): (string * Yaml.t) t =
        let* str = key in
        let* y   =
            sequence_block () |> indent 0
            </>
            (record_block () </> scalar |> indent 1)
        in
        return (str, y)
Note the subtlety of the indentation of sequence blocks and record blocks and scalars.
        # legal record element              # illegal record element
        key:                                key:
        - item1                             k1: value1
        - item2                             k2: value2
Note furthermore that each aligned block is within some indentation (zero or more columns). This is important. Without some indentation the parsers try to align the elements in the block with some outer elements (which usually fails). It is a common pitfall to forget the indentation.
Up