A Step Back: Lexers and Parsers
Learning Objectives
- You know what lexing and parsing are, and you know that they are used to transform a string of characters into a structured representation of code.
- You know the common categories of tokens, such as literals, identifiers, operators, and separators.
- You understand the difference between statements and expressions.
We kind-of just started to hack away when creating the interpreter — this is great when building something small and when learning, but as the project grows, it becomes harder to maintain and add new features. To have a shared understanding of the codebase, let’s start using the vocabulary used in the field of programming language theory and compilers.
Lexers, Parsers, Statements, and Expressions
The process of turning a string of characters into interpretable code involves lexing and parsing. A lexer is a program that converts a string of characters into a list of tokens, which represent sequences of characters such as keywords, identifiers, and operators. A parser then converts a list of tokens into a structured representation of the code. This flow is illustrated in Figure 1.
The structured representation of code consists of statements and expressions. Statements are instructions that perform an action, such as printing a value or assigning a value to a variable. Expressions are a combination of variables, operators, and other expressions, that can be evaluated to produce a value.
Statements may include expressions — e.g. a conditional statement includes an expression that is evaluated to determine which branch to take. Expressions, on the other hand, do not include statements1 — they are evaluated to produce a value, but do not perform any actions.
As an example, consider the following BASIC code:
10 LET A = 5
20 PRINT A, "Hello, BASIC!"
Above, LET A = 5
and PRINT A, "Hello, BASIC!"
are statements, while 5
, A
(in the print statement), and "Hello, BASIC!"
are expressions.
Lexing in Action
When interpreting a code, a lexer would given code into a list of tokens. Let’s consider this for the following source code. For clarity, we use the symbol ·
for the space character and ↩
for the newline character.
10·LET·A·=·5↩20·PRINT·A,·"Hello,·BASIC!"
For the above, the lexer first scans the string, converting it to lexemes. Lexemes are individual sequences of characters that form the tokens that have not yet been assigned a category.
Then, the lexer categorizes the lexemes into tokens, such as number and string literals, identifiers, operators, and separators. Each token is assigned the lexeme and the category it belongs to.
As an example, the first lexeme is tokenized as a number literal with the value "10"
, whereas the second lexeme is categorized as a keyword. The table below showcases the individual sequences of characters from the above code, and the associated token names and values.
Source code | Token name/category | Lexeme/value |
---|---|---|
10 | Literal | 10 |
LET | Keyword | LET |
A | Identifier | A |
= | Operator | = |
5 | Literal | 5 |
↩ | Separator | ↩ |
20 | Literal | 20 |
Keyword | PRINT | |
A | Identifier | A |
, | Separator | , |
”Hello, BASIC!” | Literal | "Hello, BASIC!" |
The center and rightmost columns of Table 1 form the list of tokens (when thought of as the Token name, Value
pairs) produced by the lexer.
As the tokens are considered separate, we do not have a token for the space character, but we do need to keep track of the newlines as we need to distinguish 5↩20
from 5·20
.
For common token categories, see Lexical analysis on Wikipedia.
You might also wonder why the different literals do not specify whether they are numbers or strings. In fact the lexer must be able to identify a string to not confuse "Hello, BASIC!"
with something that has a separator ,
in between. The specifics of this behavior are not important and relate to the implementation — one lexer might distinguish string and number literals, while another might not.
Parsing in Action
A parser would receive a list of tokens as an input, and then convert it into a structured representation of the code, which consists of statements and expressions.
For the above BASIC code, this would involve two statements: one for each line.
The LET
keyword corresponds to a LET Statement, which contains the identifier and the right hand side expression, while the PRINT statement would contain the list of expressions to be printed.
A commonly used format for structured representation of code is an Abstract Syntax Tree (AST). The AST is a tree-like data structure2 that represents the code in a way that is easier to interpret and execute. For example, the AST should already have the information about which literals are numbers and which strings because analyzing the tokens during execution is unnecessary and potentially much slower.
An AST for the above BASIC code is shown in Figure 2. The AST consists of nodes that represent the statements and expressions in the code.
To build an AST, the parser would iterate over the tokens, following a set of rules — the grammar of the programming language. The grammar defines how the tokens can be combined to form statements and expressions. For the BASIC code, the grammar would define that a LET
statement consists of the LET
keyword, an identifier, an equals sign, and a number literal, while a PRINT
statement consists of the PRINT
keyword, an identifier, and a list of arguments.
Grammar and Formal Language
The grammar can be defined formally, using a notation such as Backus-Naur Form (BNF), or informally, using a set of programmatic rules. The AST exactly represents the grammar of a programming language abstracting over the actual terminal symbols, such as the keywords and separators, used in the language.
Note that lexing and parsing are not specific to any programming language — they are used in the interpretation of all programming languages. The grammar, however, is specific to each language, as it defines the syntax of the language.
Footnotes
-
In functional programming, and in Gleam for instance, there isn’t a clear concept of a statement, as basically the whole program is just a set of expressions bound to top-level names (like
main
).We can even think of
import
,fn
andlet
as just syntax sugar disguising expressions.The classical way of thinking about statements is very limiting as the following code wouldn’t make any sense from that point of view:
io.println({ let msg = "message" msg })
This code is in fact working Gleam and prints the message to the standard output. ↩
-
With tree-like, we mean “contains recursive parts” i.e. expressions, in this context. In an object-oriented programming language, a data structure representing an AST might simply be a list or a key-value dictionary (with line numbers as keys and statements as values). Being a list or a key-value dictionary doesn’t qualify as tree-like. What makes ASTs tree-like is the fact that they may contain a recursive sub-language — which we often call expressions!
There are very few programming languages that don’t have expressions and most of them are considered esoteric. HTML, CSS, YAML and JSON don’t have an internal concept of expressions, but they are usually not considered programming languages in the first place. Stack-oriented programming languages like Forth don’t have expressions and can be quite hard to read. ↩