Programming Language Design 2

Parsing with Nom


Learning Objectives

  • You know how to use libraries in Rust.
  • You know of nom and can read nom code.
  • You know some basic parser combinators for nom.

Make sure you have read the chapter on Parsing and Recursion from part 4 before proceeding.

Parsing in Rust

The Rust ecosystem has a parser combinator library called nom that allows building monadic1 parsers that comes with many parser combinators. To use nom, you need to add it to your Cargo.toml:

[package]
name = "rust-project"
version = "0.1.0"
edition = "2021"

[dependencies]
nom = "8.0.0"

After this, the library is available for use in your project.

In version 8 of Nom, many things changed in a backwards-incompatible manner. When looking up information about nom, make sure it is compatible with version 8 of the library.

Parsing a single character

A parser that would parse a string consisting of either of the character a or the character b would be implemented as follows.

use nom::{
    branch::alt,
    character::complete::char,
    IResult, Parser,
};

fn ab_parser (input: &str) -> IResult<&str, char> {
    let mut parser = alt((char('a'), char('b')));
    parser.parse(input)
}

fn main() {
    let input = "a";
    match ab_parser(input) {
        Ok((_, parsed)) => {
            println!("{}", parsed);
        }
        Err(e) => println!("Error parsing: {:?}", e),
    }
}

Parsers written with nom are composed out of smaller parsers, which are combined together. The above program uses the alt combinator to combine two parsers, char('a') and char('b'), into a single parser. The char parser is used to parse a single character from the input string, while the alt combinator will try to parse the input string with the parsers it receives as arguments, one by one, and return the result of the first successful parse.

The parse method of a parser returns a Result that contains either the parsed value or an error. The Ok variant contains an IResult from nom, which is a tuple with the remaining input and the parsed value, while the Err variant contains an error.

Parsing a list of characters

List of characters can be parsed with the separated_list combinator, which takes a separator parser and a value parser as arguments. The following example shows how to parse a list of digits separated by whitespaces.

use nom::{
    character::complete::digit1,
    character::complete::multispace0,
    multi::separated_list1,
    IResult, Parser,
};

fn parse_digit_list(input: &str) -> IResult<&str, Vec<&str>> {
    separated_list1(multispace0, digit1).parse(input)
}

fn main() {
    let input = "123 456 789";
    match parse_digit_list(input) {
        Ok((_, parsed)) => {
            dbg!("{}", parsed);
        }
        Err(e) => println!("Error parsing: {:?}", e),
    }
}

The parser digit1 reads one or more digits from the input string, while multispace0 reads zero or more whitespace characters. The separated_list1 combinator is used to parse a list of digits separated by whitespaces. The separated_list1 combinator will first parse a value with the value parser, then parse a separator with the separator parser, and repeat this process until the separator parser fails.

Lisp-like parsing

To parse a Lisp-like expression, such as (+ 1 2), you would need to define a parser that reads in the opening parentheses, the operator, the operands, and the closing parenthesis.

Nom allows us to write a sequence of parser using a tuple (), as can be seen in the following example:

fn lispish_parser (input: &str) -> IResult<&str, Vec<String>> {
    let mut parser = (
        char('('),
        char('+'),
        multispace0,
        separated_list1(multispace0, digit1),
        char(')')
    );
    let (input, (par_open, op, spaces, digits, par_close)) = parser.parse(input)?;

    let mut result = vec![par_open.to_string(), op.to_string()];
    digits.iter().for_each(|s| result.push(s.to_string()));
    result.push(par_close.to_string());

    Ok((input, result))
}

Above, once the parsing is complete, the result is transformed into a Vec<String>, which is then returned as the result of the lispish_parser. It is also possible to transform the result into a different type, such as tokens.

Decoding nom’s function signatures

Open nom’s documentation.

First, the type of parsers is I -> IResult<I, O>, where I is the type of the input (most commonly &str) and O is the type of the output produced by the parser. The IResult is a simple wrapper around a result and a pair.

Let’s next take a look at a couple parser combinators in nom.

And Parser Combinator

fn and<G, O2>(self, g: G) -> And<Self, G>
where
    G: Parser<Input, Output = O2, Error = Self::Error>,
    Self: Sized,

The and parser combinator method defined for all Parsers, takes in a second parser g and returns the parser And<Self, G>. This is the actual implementation and is not usually used directly — And is a Parser too, so it’s not so important how it actually works.

If you check the documentation for And<F, G>, you will notice that the parser’s associated type Output = (<F as Parser<I>>::Output, <G as Parser<I>>::Output) is a pair consisting of the outputs of the “inner” parsers.

The and-parser can be used by calling .and directly, like this:

let parser = char('(').and(char('+'));
let (rest, (par_open, plus)) = parser.parse("(+abc").unwrap();
assert_eq!(rest, "abc");

but more conveniently, a tuple can be used to achieve the same thing:

let parser = (char('('), char('+'));
let (rest, (par_open, plus)) = parser.parse("(+abc").unwrap();
assert_eq!(rest, "abc");

Or Parser Combinator

fn or<G>(self, g: G) -> Or<Self, G>
where
    G: Parser<Input, Output = Self::Output, Error = Self::Error>,
    Self: Sized,

The or parser combinator method takes in another parser with the same output type and returns a parser of the same output type. It attempts to parse using self first, which is allowed to fail. If the self parser fails, g is used to parse the same input insteade.

many0 Parser Combinator

pub fn many0<I, F>(
    f: F,
) -> impl Parser<I, Output = Vec<<F as Parser<I>>::Output>, Error = <F as Parser<I>>::Error>
where
    I: Clone + Input,
    F: Parser<I>,

The many0 parser combinator repeats the inner parser until it fails and returns the parsed items in a vector.

The function signature says that the output of many0 is a Parser over the same input I as f, and the Output is a Vector of <F as Parser<I>>::Output, i.e. the output of the f parser.

Map Parser Combinator

fn map<G, O2>(self, g: G) -> Map<Self, G>
where
    G: FnMut(Self::Output) -> O2,
    Self: Sized,

The map combinator takes in a function, which is used to map over the result of the parser, instead of another parser.

Looking at the Map struct’s implementation, we see that it is a parser over the output of the function g.

impl<I, O2, E: ParseError<I>, F: Parser<I, Error = E>, G: FnMut(<F as Parser<I>>::Output) -> O2> Parser<I> for Map<F, G> {
    type Output = O2;
    // ...
}

The trait bound G: FnMut(<F as Parser<I>>::Output) -> O2 means that g takes as input the output of the self parser, and produces output of any type O2.

Loading Exercise...

Parsing Left-Associative Infix Notation

Loading Exercise...

Summary

Here as a nice summary of the parser combinators in nom: choosing_a_combinator.md.

Footnotes

Footnotes

  1. See also Parsec library