Parsing with Nom
Learning Objectives
- You know how to use libraries in Rust.
- You know of nom and can read nom code.
- You know some basic parser combinators for nom.
Make sure you have read the chapter on Parsing and Recursion from part 4 before proceeding.
Parsing in Rust
The Rust ecosystem has a parser combinator library called nom that allows building monadic1 parsers that comes with many parser combinators.
To use nom, you need to add it to your Cargo.toml
:
[package]
name = "rust-project"
version = "0.1.0"
edition = "2021"
[dependencies]
nom = "8.0.0"
After this, the library is available for use in your project.
In version 8 of Nom, many things changed in a backwards-incompatible manner. When looking up information about nom, make sure it is compatible with version 8 of the library.
Parsing a single character
A parser that would parse a string consisting of either of the character a
or the character b
would be implemented as follows.
use nom::{
branch::alt,
character::complete::char,
IResult, Parser,
};
fn ab_parser (input: &str) -> IResult<&str, char> {
let mut parser = alt((char('a'), char('b')));
parser.parse(input)
}
fn main() {
let input = "a";
match ab_parser(input) {
Ok((_, parsed)) => {
println!("{}", parsed);
}
Err(e) => println!("Error parsing: {:?}", e),
}
}
Parsers written with nom are composed out of smaller parsers, which are combined together. The above program uses the alt
combinator to combine two parsers, char('a')
and char('b')
, into a single parser. The char
parser is used to parse a single character from the input string, while the alt
combinator will try to parse the input string with the parsers it receives as arguments, one by one, and return the result of the first successful parse.
The parse
method of a parser returns a Result
that contains either the parsed value or an error. The Ok
variant contains an IResult
from nom, which is a tuple with the remaining input and the parsed value, while the Err
variant contains an error.
Parsing a list of characters
List of characters can be parsed with the separated_list
combinator, which takes a separator parser and a value parser as arguments. The following example shows how to parse a list of digits separated by whitespaces.
use nom::{
character::complete::digit1,
character::complete::multispace0,
multi::separated_list1,
IResult, Parser,
};
fn parse_digit_list(input: &str) -> IResult<&str, Vec<&str>> {
separated_list1(multispace0, digit1).parse(input)
}
fn main() {
let input = "123 456 789";
match parse_digit_list(input) {
Ok((_, parsed)) => {
dbg!("{}", parsed);
}
Err(e) => println!("Error parsing: {:?}", e),
}
}
The parser digit1
reads one or more digits from the input string, while multispace0
reads zero or more whitespace characters. The separated_list1
combinator is used to parse a list of digits separated by whitespaces. The separated_list1
combinator will first parse a value with the value parser, then parse a separator with the separator parser, and repeat this process until the separator parser fails.
Lisp-like parsing
To parse a Lisp-like expression, such as (+ 1 2)
, you would need to define a parser that reads in the opening parentheses, the operator, the operands, and the closing parenthesis.
Nom allows us to write a sequence of parser using a tuple ()
, as can be seen in the following example:
fn lispish_parser (input: &str) -> IResult<&str, Vec<String>> {
let mut parser = (
char('('),
char('+'),
multispace0,
separated_list1(multispace0, digit1),
char(')')
);
let (input, (par_open, op, spaces, digits, par_close)) = parser.parse(input)?;
let mut result = vec![par_open.to_string(), op.to_string()];
digits.iter().for_each(|s| result.push(s.to_string()));
result.push(par_close.to_string());
Ok((input, result))
}
Above, once the parsing is complete, the result is transformed into a Vec<String>
, which is then returned as the result of the lispish_parser
. It is also possible to transform the result into a different type, such as tokens.
Decoding nom
’s function signatures
Open nom’s documentation.
First, the type of parsers is I -> IResult<I, O>
, where I
is the type of the input (most commonly &str
) and O
is the type of the output produced by the parser.
The IResult
is a simple wrapper around a result and a pair.
Let’s next take a look at a couple parser combinators in nom
.
And Parser Combinator
fn and<G, O2>(self, g: G) -> And<Self, G>
where
G: Parser<Input, Output = O2, Error = Self::Error>,
Self: Sized,
The and
parser combinator method defined for all Parser
s, takes in a second parser g
and returns the parser And<Self, G>
.
This is the actual implementation and is not usually used directly — And
is a Parser
too, so it’s not so important how it actually works.
If you check the documentation for And<F, G>
, you will notice that the parser’s associated type Output = (<F as Parser<I>>::Output, <G as Parser<I>>::Output)
is a pair consisting of the outputs of the “inner” parsers.
The and-parser can be used by calling .and
directly, like this:
let parser = char('(').and(char('+'));
let (rest, (par_open, plus)) = parser.parse("(+abc").unwrap();
assert_eq!(rest, "abc");
but more conveniently, a tuple can be used to achieve the same thing:
let parser = (char('('), char('+'));
let (rest, (par_open, plus)) = parser.parse("(+abc").unwrap();
assert_eq!(rest, "abc");
Or Parser Combinator
fn or<G>(self, g: G) -> Or<Self, G>
where
G: Parser<Input, Output = Self::Output, Error = Self::Error>,
Self: Sized,
The or
parser combinator method takes in another parser with the same output type and returns a parser of the same output type.
It attempts to parse using self
first, which is allowed to fail.
If the self
parser fails, g
is used to parse the same input insteade.
many0
Parser Combinator
pub fn many0<I, F>(
f: F,
) -> impl Parser<I, Output = Vec<<F as Parser<I>>::Output>, Error = <F as Parser<I>>::Error>
where
I: Clone + Input,
F: Parser<I>,
The many0
parser combinator repeats the inner parser until it fails and returns the parsed items in a vector.
The function signature says that the output of many0
is a Parser
over the same input I
as f
, and the Output
is a Vec
tor of <F as Parser<I>>::Output
, i.e. the output of the f
parser.
Map Parser Combinator
fn map<G, O2>(self, g: G) -> Map<Self, G>
where
G: FnMut(Self::Output) -> O2,
Self: Sized,
The map
combinator takes in a function, which is used to map over the result of the parser, instead of another parser.
Looking at the Map
struct’s implementation, we see that it is a parser over the output of the function g
.
impl<I, O2, E: ParseError<I>, F: Parser<I, Error = E>, G: FnMut(<F as Parser<I>>::Output) -> O2> Parser<I> for Map<F, G> {
type Output = O2;
// ...
}
The trait bound
G: FnMut(<F as Parser<I>>::Output) -> O2
means thatg
takes as input the output of theself
parser, and produces output of any typeO2
.
Parsing Left-Associative Infix Notation
Summary
Here as a nice summary of the parser combinators in nom: choosing_a_combinator.md.
Footnotes
Footnotes
-
See also
Parsec
library ↩