Back to BASICs

Source to a Token Sequence with a Lexer


Learning Objectives

  • You know how to define tokens for a lexer.
  • You can build a lexer that tokenizes a simple BASIC program.

Here, we concretely look into building a lexer and defining the tokens. Let’s start with the basics, and create the functionality for tokenizing code that looks as follows.

10 LET A = 5
20 PRINT A, "Hello, BASIC!"

Tokens

To identify the tokens in the code, we can start by identifying the different categories of tokens that can be found in the code. This would involve going over the code part by part, identifying the different categories of tokens that can be found in the code.

As an example, for the first line 10 LET A = 5, the token categories would be number literal, let, identifier, equals, number literal, and end of line.

Similarly, for the second line 20 PRINT A, "Hello, BASIC!", the token categories would be number literal, print, identifier, comma, string literal, and end of line.

Combining what we identified so far, we can create an enumerable list of token categories, such as number literal, let, identifier, equals, end of line, print, string literal, and end of line. To do this programmatically, we enum.

Create a file called tokens.dart in the project, and add the following code to it.

enum Category {
  comma,
  endOfLine,
  equals,
  identifier,
  let,
  numberLiteral,
  print,
  stringLiteral,
}

Then, in the same file, create a class Token that represents individual tokens. The class should have a category and a value associated with it. The method bool operator ==(Object other) is added to allow comparing tokens for equality.

class Token {
  final Category category;
  final String value;

  Token(this.category, this.value);

  @override
  String toString() {
    return "$category($value)";
  }

  @override
  bool operator ==(Object other) {
    return other is Token && other.category == category && other.value == value;
  }
}

Lexer

With the tokens identified, we can create a Lexer class that takes a string of code and converts it into a list of tokens. Create a file called lexer.dart and add the following class Lexer into it.

import "tokens.dart";

class Lexer {
  final String source;
  int position = 0;

  Lexer(this.source);
}

Then, for the tokenization, we would want to add a method to the Lexer class that reads the tokens from the source code and returns them as a list. The method would go over the source code, character by character, and form tokens based on the characters it encounters. As a starting point, we can create a method tokenize that goes over the code and returns an empty list of tokens.

List<Token> tokenize() {
  List<Token> tokens = [];

  while (position < source.length) {
    String char = source[position];
    position++;
  }

  return tokens;
}

We can now start adding the logic for tokenizing the code. To get started, create a test file lexer_test.dart in the test/ directory, and add the following code to it.

import 'package:basic_interpreter/lexer.dart';
import 'package:basic_interpreter/tokens.dart';
import 'package:test/test.dart';

void main() {
}

We’ll again work in a test-driven manner, starting with number literals.

Number literals

Number literals are sequences of digits that represent numbers. We can identify number literals by looking for sequences of digits in the code. The tokenizer should be able to extract tokens that represent number literals from the code. Add the following test to the main method of the lexer_test.dart file.

  test('Number literals: 10 4.2', () {
    var lexer = Lexer('10 4.2');
    expect(lexer.tokenize(), [
      Token(Category.numberLiteral, '10'),
      Token(Category.numberLiteral, '4.2')
    ]);
  });

We can run the tests in a specific file by adding the file name as an argument to the test command. In our case, to run the tests in the lexer_test.dart file, we run the command dart test test/lexer_test.dart in the terminal. The output shows that some tests failed.

// ...
00:00 +0 -1: Some tests failed.

To identify number literals, we can use a regular expression to identify digits as we are scanning the source code. If a digit is encountered, we can extract the sequence of digits that start from it’s position and form a number literal token, adding it to the list of tokens. The relevant regular expressions are as follows.

final digitRegex = RegExp(r'\d');
final numberLiteralRegex = RegExp(r'(\d+(\.\d+)?)');

Using the regular expressions, the concrete logic for extracting the number literals is as follows. The method matchAsPrefix allows using a regular expression to match a prefix of a string — i.e., using the regular expression to match the string starting from a given position.

List<Token> tokenize() {
  final numberRegex = RegExp(r'\d');
  final numberLiteralRegex = RegExp(r'(\d+(\.\d+)?)');

  List<Token> tokens = [];

  while (position < source.length) {
    String char = source[position];
    if (numberRegex.hasMatch(char)) {
      final match = numberLiteralRegex.matchAsPrefix(source, position);
      if (match != null) {
        tokens.add(Token(Category.numberLiteral, match.group(1)!));
        position = match.end;
        continue;
      }
    }

    position++;
  }

  return tokens;
}

Now, when we run the tests, we see that the tests pass.

00:00 +1: All tests passed!

String literals

String literals are sequences of characters that are enclosed in double quotes. Add the following test to the lexer_test.dart file.

  test('String literals: "Hello, BASIC!"', () {
    var lexer = Lexer('"Hello, BASIC!"');
    expect(lexer.tokenize(), [
      Token(Category.stringLiteral, 'Hello, BASIC!')
    ]);
  });

When we run the test, we see that it fails.

00:00 +1 -1: Some tests failed.

In the lexer, we can identify the start of a string literal by looking for a double quote character ("). Then, if we find a double quote character, we can use a regular expression to identify the sequence of characters between the double quotes. The relevant regular expression is as follows.

final stringLiteralRegex = RegExp(r'"(.*?)"');

Checking whether a character is a double quote is added as an else if block to the tokenize method. With the change, the method looks as follows.

List<Token> tokenize() {
  // ...
  final stringLiteralRegex = RegExp(r'"(.*?)"');

  List<Token> tokens = [];

  while (position < source.length) {
    String char = source[position];
    if (numberRegex.hasMatch(char)) {
      // ...
    } else if (char == '"') {
      final match = stringLiteralRegex.matchAsPrefix(source, position);
      if (match != null) {
        tokens.add(Token(Category.stringLiteral, match.group(1)!));
        position = match.end;
        continue;
      }
    }

    position++;
  }

  return tokens;
}

Now, when we run the tests, we see that the tests pass.

// ...
00:00 +2: All tests passed!

Let’s add another test to check that the lexer can handle two string literals separated by a comma.

  test('String literals: "Hello", "BASIC"', () {
    var lexer = Lexer('"Hello", "BASIC"');
    expect(lexer.tokenize(), [
      Token(Category.stringLiteral, 'Hello'),
      Token(Category.stringLiteral, 'BASIC')
    ]);
  });

When we run the tests, we see that they pass. Time to move forward.

Keywords and identifiers

For keywords and identifiers, we can use a regular expression to identify a character sequence from the source code. Then, when we have extracted the character sequence, we can decide whether it is a keyword or an identifier based on the sequence of characters. Add the following test to the lexer_test.dart file.

  test('Keywords and identifiers: LET PRINT A', () {
    var lexer = Lexer('LET PRINT A');
    expect(lexer.tokenize(), [
      Token(Category.let, 'LET'),
      Token(Category.print, 'PRINT'),
      Token(Category.identifier, 'A')
    ]);
  });

When we run the tests, we see that the tests fail.

00:00 +3 -1: Some tests failed.

Implementation-wise, we can use just one regular expression to check whether the character is a letter or whether it is a combination of letters. The regular expression would be as follows.

final keywordOrIdentifierRegex = RegExp(r'[A-Z]+');

The regular expression would be added as an else if block to the tokenize method, which would then look as follows.

List<Token> tokenize() {
  final keywordOrIdentifierRegex = RegExp(r'[A-Z]+');
  // ...

  while (position < source.length) {
    String char = source[position];
    if (numberRegex.hasMatch(char)) {
      // ...
    } else if (char == '"') {
      // ...
    } else if (keywordOrIdentifierRegex.hasMatch(char)) {
      final match = keywordOrIdentifierRegex.matchAsPrefix(source, position);
      if (match != null) {
        // distinguish between keywords and identifiers

        position = match.end;
      }
    }

    position++;
  }

  return tokens;
}

To distinguish between keywords and identifiers, we can use a switch expression on the matched character sequence. The switch expression would check the value of the character sequence and return a token based on the value. With this functionality, the contents for the else if branch for keywords and identifiers would look as follows.

final match = keywordOrIdentifierRegex.matchAsPrefix(source, position);
if (match != null) {
  final keyword = match.group(0)!;
  final token = switch (keyword) {
    "LET" => Token(Category.let, keyword),
    "PRINT" => Token(Category.print, keyword),
    _ => Token(Category.identifier, keyword),
  };

  tokens.add(token);
  position = match.end;
  continue;
}

Now, when we run the tests, we again see that they pass.

00:00 +4: All tests passed!

Comma, equals, and new line

Let’s next add the tokens for the comma, equals, and new line characters. Add the following test to the lexer_test.dart file.

  test('Comma, equals, and new line: , = \n', () {
    var lexer = Lexer(', = \n');
    expect(lexer.tokenize(), [
      Token(Category.comma, ','),
      Token(Category.equals, '='),
      Token(Category.endOfLine, '\n')
    ]);
  });

When we run the tests, we see that they fail.

00:00 +4 -1: Some tests failed.

Adding the comma, equals, and new line tokens is quite straightforward. For all three, we would look at the character that is currently being scanned. If it matches the character we’re looking for, we would add the token to the list of tokens. This would lead to three new else if blocks in the tokenize method, as shown below.


List<Token> tokenize() {
  // ...

  while (position < source.length) {
    String char = source[position];
    if (numberRegex.hasMatch(char)) {
      // ...
    } else if (char == '"') {
      // ...
    } else if (keywordOrIdentifierRegex.hasMatch(char)) {
      // ...
    } else if (char == ',') {
      tokens.add(Token(Category.comma, char));
      position++;
      continue;
    } else if (char == '=') {
      tokens.add(Token(Category.equals, char));
      position++;
      continue;
    } else if (char == '\n') {
      tokens.add(Token(Category.endOfLine, char));
      position++;
      continue;
    }

    position++;
  }

  return tokens;
}

When we run the tests, we notice that the newly added test now passes, but the test with the name String literals: "Hello", "BASIC" fails. This is because the test input had a comma, but the test output did not expect the comma token. To fix this, we can update the test to expect the comma token.

  test('String literals: "Hello", "BASIC"', () {
    var lexer = Lexer('"Hello", "BASIC"');
    expect(lexer.tokenize(), [
      Token(Category.stringLiteral, 'Hello'),
      Token(Category.comma, ','),
      Token(Category.stringLiteral, 'BASIC')
    ]);
  });

Now, when we run the tests, we see that they all pass.

00:00 +5: All tests passed!

Trust but verify!

Now, we have tests for the individual tokens. However, we have not tested them together. To do this, add a new test that tests the lexer with a full program.

  test('Full program', () {
    var lexer = Lexer('''10 LET A = 5
20 PRINT A, "Hello, BASIC!"''');
    expect(lexer.tokenize(), [
      Token(Category.numberLiteral, '10'),
      Token(Category.let, 'LET'),
      Token(Category.identifier, 'A'),
      Token(Category.equals, '='),
      Token(Category.numberLiteral, '5'),
      Token(Category.endOfLine, '\n'),
      Token(Category.numberLiteral, '20'),
      Token(Category.print, 'PRINT'),
      Token(Category.identifier, 'A'),
      Token(Category.comma, ','),
      Token(Category.stringLiteral, 'Hello, BASIC!'),
    ]);
  });

When we run the tests, we see that they pass.

00:00 +6: All tests passed!

Yay! We have a working lexer that can tokenize a simple BASIC program. The next step is building a parser that can convert these tokens into expressions and statements.

Loading Exercise...