Programming Language Design

Parsing and Recursion


Learning Objectives

  • You understand how a grammar for prefix-style arithmetic language is defined
  • You know how to implement a parser for the grammar using Gleam
  • You know what an abstract syntax tree is and how it’s different from a parse tree
  • You know how to implement a translation function from a parse tree to an AST

As extra readings for this chapter, see Chapter 3 from the book Introduction to the Theory of Programming Languages

Parsing and Parse Trees

A parse tree is a data structure that represents the syntactic structure of source code derived based on the rules of a given programming language grammar. Each node in a parse tree corresponds to a construct or a token in the grammar, and the hierarchy of the tree reflects how the code is broken down: from high-level constructs like functions and statements to individual keywords, symbols, and identifiers. The hierarchical representation is a direct outcome of the parsing process, where a parser reads the code and matches it against the rules of the grammar.

Parse trees serve as an intermediate form that shows how the code has been parsed, showcasing the order and grouping of tokens. This intermediate structure helps in further analysis. Parse trees can also be used to generate the exact code back from the tree, which is useful in code generation and optimization.

As an example, the following Gleam code:

fn fibonacci(n) {
    case n {
    0 -> 0
    1 -> 1
    _ -> fibonacci(n - 1) + fibonacci(n - 2)
  }
}

pub fn main() {
    let x = fibonacci(3)
    // do something with x
}

Could be represented as a parse tree somewhat like the one in Figure 1 below.

Fig 1. — A parse tree representing a fibonacci function written in Gleam.

Loading Exercise...

In traditional parsing setups, input strings are first broken into tokens with a lexer before the parsing phase. This allows simplifying the grammar by handling low-level details (e.g., whitespace, comments) in the lexer. However, there are languages where tokenization rules can become cumbersome.

Scannerless parsing addresses this by combining the tokenization and parsing into one process, making it possible to handle context-sensitive lexical elements and avoid ambiguities that can arise when the grammar is segmented into tokens. Scannerless parsing operates on the raw input, meaning the parser handles what would traditionally be the job of the lexer. While this can simplify some workflows, it can also lead to more complex grammar.

A Language for Arithmetic

Let’s design a language for a calculator which uses a prefix-style Polish notation. The tokens in the language are number literals and operators with fixed arity of two. Number literals are any sequence of number characters separated by whitespace. Operators are +, -, *, etc.

Here are some examples of expressions in this language:

  • 0
  • + 1 2
  • - 3 5

For now, negative number literals are not supported.

The grammar of the language is defined using the following derivation rules:

  • A Digit is a character between 0-9.
  • A Literal is a string of one or more Digits.
  • An Operator is one of +, - or *.
  • A Space is a string of one or more whitespace characters.
  • An Expression is an Op followed by two Exprs separated by Spaces, or it’s a Lit. Notice that operators or spaces are not expressions.

Which can be written in Backus-Naur form:

Digit"0"|"1"|"2"|"3"|"4"|"5"|"6"|"7"|"8"|"9"LitDigit|LitDigitOp"+"|"-"|"*"Space" "|Space" "ExprOpSpaceExprSpaceExpr|Lit

Here, the characters in double quotes denote terminals of the language, i.e. non-reducible lexical elements.

Translating this into algebraic data types is quite easy, we just need to come up with names for the data constructors:

Run the program to see the output

The data constructor naming might be a bit confusing, as Space produces a term of type Space. In Gleam, it’s always unambiguous whether a literal is in the position of a value or in the position of a type. For example, here’s the details of what is a type and what is a value under pub type Expr

pub type Expr {
//       ~~~~ Type
    Expr(Op, Space, Expr, Space, Expr)
//       ^---^------^-----^------^-- Type
//  ^--- Constructor
}

Sketching a Parser

A parser for this grammar converts a String into an Expr, or at least it tries to. It should do it from left to right, as that is standard, and it follows the order of string indexing.

It is possible that some string of characters, like +*-1abcd, doesn’t match any of the grammar’s derivation rules, so it can’t be parsed. To indicate failure, let’s use an Option(Expr) for now.

so let’s start writing a function parse_expr of type StringOption(Expr) and see how far we get:

pub fn parse_expr(s: String) -> Option(Expr) {
    todo
}

First, we need to decide whether to use the Expr or Lit constructors, to do that, we could try parsing an Op, or if that fails we parse a Lit. The parsers for Op and Lit could be parse_op : String -> Option(Op) and parse_lit : String -> Option(Lit) respectively. Let’s add a todo for those also…

pub fn parse_op(s: String) -> Option(Op) {
    todo
}

pub fn parse_lit(s: String) -> Option(Lit) {
    todo
}

pub fn parse_expr(s: String) -> Option(Expr) {
    todo
}

Ok, we’re making progress, next let’s attempt to parse an Op. To do that, we should probably call parse_op and analyze whether it succeeded or not.

pub fn parse_op(s: String) -> Option(Op) {
    todo
}

pub fn parse_lit(s: String) -> Option(Lit) {
    todo
}

pub fn parse_expr(s: String) -> Option(Expr) {
    case parse_op(s) {
        Some(op) -> todo    // `(1)`
        None -> todo        // `(2)`
    }
}

Now, on line (1) we have access to an op: Op, so let’s continue parsing the string for a space, right?

Wait a minute, how do we know where to continue!

The parse_op just “looks” at the string and gives an Op without telling where it starts or ends. We can proceed by returning the rest of the string from our parsers using a tuple.

For example, the parse_op parser should instead return an Option(#(String, Op)) to indicate where we should continue parsing from if the parser was successful.

pub fn parse_op(s: String) -> Option(#(String, Op)) {
    todo
}

pub fn parse_lit(s: String) -> Option(#(String, Lit)) {
    todo
}

pub fn parse_expr(s: String) -> Option(#(String, Expr)) {
    case parse_op(s) {
        Some(#(rest, op)) -> todo
        None -> todo
    }
}

Now we should continue by parsing a Space. Let’s finally implement an actual parser, the parse_spaces parser.

Loading Exercise...

To make the code a bit neater, let’s define a type alias for Option(#(String, Space)) and abstract over the return type of Space.

pub type ParseOption(a) = Option(#(String, a))

Type aliases are really helpful for reducing repetition on the type level, but they may be a bit misleading as the reader has to be familiar with the type alias.

Many typed programming languages support type aliases. Among them are Dart, Gleam, Rust, Haskell, and even C++.

Here is the parse function which parses spaces after parsing an operator:

pub fn parse_expr(s: String) -> ParseOption(Expr) {
    case parse_op(s) {
        Some(#(rest, op)) -> {
            case parse_spaces(rest) {
                Some(#(rest, spaces)) -> {
                    todo
                }
                None -> todo    // `(1)`
            }
        }
        None -> todo            // `(2)`
    }
}

The indentation level is threateningly increasing…

Now, let’s refine the todos in the None match expressions. The todo on line (2) should try to parse a literal before returning None.

Reminder: definition of the Expr type:

pub type Expr {
    Expr(Op, Space, Expr, Space, Expr)
    Lit(Lit)
}
pub fn parse_expr(s: String) -> ParseOption(Expr) {
    case parse_op(s) {
        Some(#(rest, op)) -> {
            case parse_spaces(rest) {
                Some(#(rest, spaces)) -> {
                    todo
                }
                None -> todo    // `(1)`
            }
        }
        None -> case parse_lit(s) {
            Some(#(rest, lit)) -> {
                Some(#(rest, Lit(lit)))
            }
            None -> None
        }
    }
}

The other todo from (1) is more interesting. It could be that parsing an operator succeeds, but there is no space after it, for example if the input is just +. Clearly in this case parsing fails and returns a None.

However, it could be possible that in general (for some other grammar) we should try the parse_lit again instead of returning a None and there the question is whether to parse s or rest

This becomes a concern when we are parsing comparison operators, like > and >=.

Parsing is quite complicated to get exactly right, but for now we are satisfied with just returning None.

pub fn parse_expr(s: String) -> ParseOption(Expr) {
    case parse_op(s) {
        Some(#(rest, op)) -> {
            case parse_spaces(rest) {
                Some(#(rest, spaces)) -> {
                    todo
                }
                None -> None
            }
        }
        None -> case parse_lit(s) {
            Some(#(rest, lit)) -> {
                Some(#(rest, Lit(lit)))
            }
            None -> None
        }
    }
}

Before we keep adding more code to the last todo, let’s try to flatten the structure of the code. To do this, let’s use the use notation along with option.then.

pub fn parse_expr(s: String) -> ParseOption(Expr) {
    case parse_op(s) {
        Some(#(rest, op)) -> {
            use #(rest, spaces) <- option.then(parse_spaces(rest))
            todo
        }
        None -> case parse_lit(s) {
            Some(#(rest, lit)) -> {
                Some(#(rest, Lit(lit)))
            }
            None -> None
        }
    }
}

Finishing the parser is as simple as filling the block in with the following:

Some(#(rest, op)) -> {
    use #(rest, spaces1) <- option.then(parse_spaces(rest))
    use #(rest, expr1) <- option.then(parse_expr(rest))
    use #(rest, spaces2) <- option.then(parse_spaces(rest))
    use #(rest, expr2) <- option.then(parse_expr(rest))
    Some(#(rest, Expr(op, spaces1, expr1, spaces2, expr2)))
}

Finally, before you get to implement the operator, digit and literal parsers, we have to define the top-level parser, which succeeds only if parse_expr consumes the whole input.

pub fn parse(s: String) -> Result(Expr, String) {
  case parse_expr(s) {
    Some(#(rest, expr)) -> {
      case rest == "" {
        True -> Ok(expr)
        False -> Error("unexpected input: '" <> rest <> "'")
      }
    }
    None -> Error("parsing failed")
  }
}
Loading Exercise...

Creating a REPL

Let’s make it a bit easier for us to play around with our parsers. We’ll create a read-eval-print loop program (a REPL), although we’ll skip the evaluation for now, thus making it a read-print loop.

To achieve this, we’ll use the get_line function from the gleam_erlang package. We can add this module to our project by running gleam add gleam_erlang.

The repl function calls itself recursively until reading the next line fails.

import gleam/erlang
import gleam/result
import gleam/io

fn repl() {
  use line <- result.try(erlang.get_line("> "))
  let line = string.trim_end(line)
  case parse(line) {
    Ok(expr) -> {
      io.debug(expr)
      Nil
    }
    Error(e) -> {
      io.println(e)
    }
  }

  repl()
}

pub fn main() {
  io.println("Hello from repl!")
  repl()
}

An example session with the REPL (RPL), could look like this:

$ gleam run
   Compiled in 0.03s
    Running repl.main
Hello from repl!
> + 1 2
Expr(Plus, Space, Lit(Digit(One)), Space, Lit(Digit(Two)))
> + * 2 3 + 2 1
Expr(Plus, Space, Expr(Times, Space, Lit(Digit(Two)), Space, Lit(Digit(Three))), Space, Expr(Plus, Space, Lit(Digit(Two)), Space, Lit(Digit(One))))
> a
parsing failed
> b
parsing failed
> + 1
parsing failed
> + 1 2 3
unexpected input: ' 3'
>

Higher-Level Parsing

Now that we have defined the grammar formally, we can start working with it in a more algorithmic manner. What you probably expect, is for us to define an evaluation function which converts and Expr into an integer representing the outcome of the computation. This, however, is the subject for the next chapter. Instead, we will create a more abstract representation for the grammar.

The representation of the language is really verbose right now, as it represents the concrete syntax.

Also, if we would write an evaluation function for the parse tree, it would have to parse the number literals into actual numbers even though we could do that before evaluation.

So instead of working with concrete syntax, let’s start working with abstract syntax trees.

Abstract Syntax Trees

So far, our parsers have constructed parse trees, sometimes called concrete syntax trees (CST). A CST is, as the name implies, a concrete representation of the parsed string, meaning that no information is lost. An abstract syntax tree, on the other hand, gets rid of irrelevant syntactic information, such as Spaces and brackets.

An abstract syntax tree represents the grammar of the language in a more “pure” and “meaningful” way — it represents computation as closely as possible. It identifies the operators as actual nodes in the tree with argument expressions as subtrees. For example, the AST should not contain things irrelevant for evaluation, such as spaces. However, the AST should contain the structure of the code, like how things are parenthesized.

Loading Exercise...

The conversion from the parse tree into an AST involves, among other things, parsing every integer literal into an actual Gleam integer. This conversion will be left as an exercise, and we will now focus on writing a parser that translates the string directly to an AST.

The following AST represented using an ADT neatly represents every different computation of our language. The reason every operator is laid out as data constructors is because they can have different arities.

We will refer to the abstract syntax as “terms”, because “expression” is a lexical (i.e. source code level) concept.

TermNum(Int)|Add(Term,Term)|Sub(Term,Term)|Mul(Term,Term)

The terms consist of three operations, addition, subtraction and multiplication. We use the short names Add, Sub and Mul to refer to these respectively. In addition, the term Num represents an integer.

We use the words “num”, “add”, “sub” and “mul” in such a manner that reflects the computation rather than the syntax. E.g. we could have used the words “number literal”, “plus”, “minus” and “times” as those are the pieces of syntax that our code consists of. However, as we will see later, terms can “reduce” into other terms during evalution, but source code never changes form during evaluation.

As these operations are of arity two (they take in two terms), we can spell out their types:

Add:(Term,Term)TermSub:(Term,Term)TermMul:(Term,Term)TermNum:Term

We use the mathematical symbol for integers because we try to be language agnostic and are working at a very high level. The programming language details, such as what is the largest integer, are not of any significance to the AST.

Without the non-recursive Num constructor, our Term type would be degenerate as each value would have to be infinitely large.

As an example, all of the following are meaningful terms:

Num(5)Add(Num(5),Num(7))Mul(Num(5),Add(Num(7),Num(1)))

These are all correctly formed, because every data constructor is used according to their type signature. The following, however, are not correct expressions:

Num()Add(Num(5))Mul(Num(5),Num(7,1))
Loading Exercise...

Parsing the AST From Scratch

Instead of parsing strictly according to the grammar, we can directly parse the input into a Term. This approach is quicker to implement and we will use it in the later chapters as the concrete syntax is not as important as the abstract syntax for our purposes.

For the purposes of building a production scale compiler, your mileage may vary.

The next exercise is about writing an AST parser for the same arithmetic language.

Loading Exercise...

Postfix Arithmetic

The grammar in this chapter had the operator in front of the arguments, but a different grammar with operators after the arguments would have the same abstract syntax.

For example, if our language used the reverse Polish notation, 1 2 + would be represented by the term Add(Num(1),Num(2)) and 1 2 3 * + by Add(Num(1),Mul(Num(2),Num(3))).

To make this change, the whole parser layout has to change. Feel free to investigate the following code, but understanding it completely is not required :)

Run the program to see the output

The usual way to write a postfix parser would be much simpler, as it just manipulates a stack while parsing from left to right, whereas the above uses a recursive approach.

Translating the Parse Tree to AST

translate:ExprTerm

The process of translating a parse tree to an AST will lose some information, such as the number of spaces. Because both structures have the same structure (both are also recursive), the process is very simple, and we can convert the tree leaf-by-leaf.

The leaves of the parse tree are literals, so a function from literals to integers is pretty much the only thing required by translate. In addition, translation will be total, because every literal can be converted to an integer and the process can’t end up in an infinite loop as the trees are finite in size.

Here is more discussion about the concept: StackOverflow.

Implement the translation such that it agrees with the parser from the previous exercise.

Loading Exercise...

Summary

In this chapter we implemented parsers for both a BNF grammar and an AST. The parsers take in a string and return an optional pair of string and something, for example an integer parser returns a pair of string and Int. The parsers could be composed easily with the help of option.then and the use syntax. Finally we crafted a function to translate the parse tree into the AST.

In the following chapters we will work focus implementing evaluation and type checking for the ASTs.