Strings and State-Based Parsing
Learning Objectives
- You know the basics of working with Strings and Characters.
- You know of state-based parsing and can tokenize a simple language with it.
As a broader theme of the course has been understanding how programming languages work, let’s again look into parsing. For this, we also peek at strings and characters in Rust.
Strings and characters
We’ve previously distinguished between string slices &str
and String
in Rust. The former, &str
, is a reference to a string slice — all string literals are represented as string slices. The latter, String
, is a heap-allocated string, which allows for dynamic resizing and modification.
To transform a string slice to a String
, you can use the to_string
method.
Strings have a method chars()
which returns an iterator over the characters in the string. Characters in Rust are Unicode scalar values, which can be more than one byte long.
Individual characters are of type char
in Rust. When comparing a char
to another character, you can use single quotes, e.g. 'a'
.
Characters — see char — have methods like is_whitespace
, is_numeric
, and is_ascii_punctuation
that provide a convenient way to study the characters.
State-based parsing
We’ve previously looked at lexing and parsing with both Dart and Gleam. It’s about time that we look into it also in Rust. Let’s build a lexer that transforms the polish notation expression (+ 1 2)
into a list of tokens.
We’ll use a state-based parsing approach, where we define a state machine that transitions between states based on the input characters.
Lexing parenthesis
Let’s start by lexing parentheses. To do this, we define a token type ParOpen
for an opening parenthesis (
and a token type ParClose
for a closing parenthesis )
.
enum Token {
ParOpen,
ParClose,
}
In addition, we define a state Start
to represent the starting state of the lexer.
enum State {
Start,
}
Now we are ready to write the lexer. The lexer will walk through the input string character by character and produce a list of tokens. We start by defining the lexer
function, which takes a string as input and returns a list of parentheses tokens.
Lexing operators
The operators +
, -
, *
, and /
are also characters, so we can handle them in the same way as we did with parentheses. We’ll add a new token type Op
to represent operators.
enum Token {
ParOpen,
ParClose,
Op(String),
}
And, then we just modify the lexer to handle operators. We can use the contains
method of a string to check if a character is one of the operators — if yes, we push an Op
token to the list of tokens with the operator as a string.
Tokenizing numbers
Let’s next tokenize numbers. We’ll add a new token type Num
to represent numbers.
enum Token {
ParOpen,
ParClose,
Op(String),
Num(String),
}
To tokenize numbers, we need to acknowledge that a number can consist of multiple characters. As we’re working on a state-based lexer, we can create a new state InNumber
to represent the state of parsing a number.
enum State {
Start,
InNumber(String),
}
Now, the lexing logic also changes. When matching the state, if the current character is a number, we switch to the InNumber
state and start collecting the number characters. The numbers are collected to a string. Finally, when we encounter a non-numeric character, we push a Num
token to the list of tokens with the number as a string, and switch the state back to Start
.
When switching the state back to start, we do not move the index
i
forward, so the current character is reprocessed in theStart
state.