Safety and Efficiency with Rust

Strings and State-Based Parsing

Learning Objectives

You know the basics of working with Strings and Characters.
You know of state-based parsing and can tokenize a simple language with it.

As a broader theme of the course has been understanding how programming languages work, let’s again look into parsing. For this, we also peek at strings and characters in Rust.

Strings and characters

We’ve previously distinguished between string slices &str and String in Rust. The former, &str, is a reference to a string slice — all string literals are represented as string slices. The latter, String, is a heap-allocated string, which allows for dynamic resizing and modification.

To transform a string slice to a String, you can use the to_string method.

Strings have a method chars() which returns an iterator over the characters in the string. Characters in Rust are Unicode scalar values, which can be more than one byte long.

Individual characters are of type char in Rust. When comparing a char to another character, you can use single quotes, e.g. 'a'.

Characters — see char — have methods like is_whitespace, is_numeric, and is_ascii_punctuation that provide a convenient way to study the characters.

Loading Exercise...

State-based parsing

We’ve previously looked at lexing and parsing with both Dart and Gleam. It’s about time that we look into it also in Rust. Let’s build a lexer that transforms the polish notation expression (+ 1 2) into a list of tokens.

We’ll use a state-based parsing approach, where we define a state machine that transitions between states based on the input characters.

Lexing parenthesis

Let’s start by lexing parentheses. To do this, we define a token type ParOpen for an opening parenthesis ( and a token type ParClose for a closing parenthesis ).

enum Token {
    ParOpen,
    ParClose,
}

In addition, we define a state Start to represent the starting state of the lexer.

enum State {
    Start,
}

Now we are ready to write the lexer. The lexer will walk through the input string character by character and produce a list of tokens. We start by defining the lexer function, which takes a string as input and returns a list of parentheses tokens.

Lexing operators

The operators +, -, *, and / are also characters, so we can handle them in the same way as we did with parentheses. We’ll add a new token type Op to represent operators.

enum Token {
    ParOpen,
    ParClose,
    Op(String),
}

And, then we just modify the lexer to handle operators. We can use the contains method of a string to check if a character is one of the operators — if yes, we push an Op token to the list of tokens with the operator as a string.

Tokenizing numbers

Let’s next tokenize numbers. We’ll add a new token type Num to represent numbers.

enum Token {
  ParOpen,
  ParClose,
  Op(String),
  Num(String),
}

To tokenize numbers, we need to acknowledge that a number can consist of multiple characters. As we’re working on a state-based lexer, we can create a new state InNumber to represent the state of parsing a number.

enum State {
  Start,
  InNumber(String),
}

Now, the lexing logic also changes. When matching the state, if the current character is a number, we switch to the InNumber state and start collecting the number characters. The numbers are collected to a string. Finally, when we encounter a non-numeric character, we push a Num token to the list of tokens with the number as a string, and switch the state back to Start.

When switching the state back to start, we do not move the index i forward, so the current character is reprocessed in the Start state.

Loading Exercise...

← Structs, Traits, Enums, and Generics

Recap and Feedback →