String Processing
Learning Objectives
- You know how to work with strings in Dart and know how to iterate over characters in a string.
- You can look for occurrences in a string and know how to use regular expressions to find patterns in strings.
- You can extract data from strings using regular expressions.
Strings are a basic data type in many programming languages. They are used to represent text. As source code is text, understanding how to work with strings is essential when designing and implementing with programming languages.
Strings in Dart
Strings in Dart are represented by the String
class, which provides methods and properties for working with strings. The shorthand for creating a string is providing a string within quotes ("
) or single quotes ('
), while multi-line strings can be created using triple quotes ('''
or """
).
Not all programming languages have a built-in string data type. For example, in C, strings are represented as arrays of characters, and there are no built-in string operations.
The String
class has a property length
that returns the number of characters in the string, and individual characters can be accessed using the index operator []
that takes an integer index and returns the character at that index.
The following example demonstrates how to create a string, print its length, and show the character at index 1.
Like in many other programming languages, strings in Dart are immutable, which means that they cannot be changed after they have been created. This is done for performance reasons, as it allows the compiler to optimize string operations. When you modify a string, you are actually creating a new string with the modified content.
String encoding
Strings in Dart are encoded using UTF-16, which is a variable-length encoding that uses one or two 16-bit code units to represent characters. This allows Dart to represent characters from different languages and scripts, as well as special characters, such as emojis. As an example, a string like "🎯"
is represented using two 16-bit code units.
This means that iterating over them using the index operator []
may not work as expected, as some characters are represented using multiple code units. This is shown below.
Dart provides a property runes
that returns an iterable of Unicode code points in the string. The code points are represented as integers, and you can convert them to characters using the String.fromCharCode
method.
In the following, we iterate over the runes, printing each using the String.fromCharCode
method.
There is also a convenience library characters that provides an extension that allows iterating over the characters. The use of the characters package is shown below.
In these materials, for simplicity, we assume that the strings do not contain characters that are represented using multiple code units. If you are working with strings that contain such characters, you should use the runes
property or the characters
package when working with strings.
Parsing strings to numbers
Dart provides methods for parsing strings to numbers. The int.parse
method is used to parse a string to an integer, while the double.parse
method is used to parse a string to a double. If the string cannot be parsed, an exception is thrown.
The following example shows how to parse a string to an integer and a double.
If you are unsure whether the string can be parsed, you can use the tryParse
method, which returns null
if the string cannot be parsed. The following example shows how to use the tryParse
method to parse a string to an integer and a double.
Integers and doubles inherit from the num
class, which provides methods for working with numbers. If it is unclear what the type of the number is, you can use the num
class to work with the number. The num
class also provides a method for parsing strings to numbers, num.parse
. The following example shows how to use the num.parse
method to parse a string to a number.
Looking for occurrences in string
Checking whether a string contains another string can be done with the contains
method, which returns a boolean. If we would want to, for example, count the number of vowels in a string, we could use the following code.
The contains
method is case-sensitive, so if we would want to e.g. count both upper and lower case vowels, we would need to convert the string to lower or upper case before checking. Converting to upper case can be done with the method toUpperCase
, while converting to lower case is done with toLowerCase
.
In the following, we count the number of vowels in a string, regardless of the case of the characters.
While the above show some examples for looking for occurrences in a string, Dart provides a handful of other methods for for working with strings. Some of the most commonly used methods are:
startsWith
andendsWith
to check if a string starts or ends with another string.indexOf
andlastIndexOf
to find the index of a substring in a string.split
to split a string into a list of substrings.trim
to remove leading and trailing whitespace.replaceAll
to replace all occurrences of a substring with another string.
In addition to the above, when going beyond string manipulation into looking for patterns, regular expressions are the way to go.
Regular expressions
Regular expressions are a pattern-matching language that allows you to search for patterns in strings. They are used in a wide range of applications, both within programming and outside of it — as an example, in addition to programming, regular expressions are used in command line tools, text editors, and spread sheets.
Finding literals
Dart provides a RegExp class that allows defining and working with regular expressions. The most basic regular expression would be a literal, such as a
, which matches the character a
. The following example shows how to use a regular expression to find all vowels in a string.
Above, we use the hasMatch
method to check if the string contains the character “a”. The method returns a boolean that indicates whether the regular expression matches the string.
Note that the regular expression is defined as a raw string, which is done by prefixing the string with
r
. This is done to avoid having to escape backslashes in the regular expression.
The string literals can be of any length, and they can contain any characters. The following example shows how to use a regular expression to find the string “ello” in a string.
Regular expressions come with modifiers, which allow you to specify options for the regular expression. The most commonly used modifier is the i
modifier, which makes the regular expression case-insensitive.
In Dart, the modifier is specified as an optional parameter caseSensitive
in the RegExp
constructor, as shown below.
Character classes and alternation
If we would like to look for all vowels in a string, we can use character classes, which are defined using square brackets []
. The following example shows how to use a regular expression to check if a string contains a vowel.
The above could have alternatively been written with alternation, which is used to match one of several alternatives. Alternation is defined as a vertical bar |
. The following example shows how to use alternation to check if a string contains a vowel.
Special character sequences
In addition to literals and character classes, regular expressions allow using escaped character sequences that match specific characters. Some of the most commonly used special character sequences are \d
that matches digits, \w
that matches word characters, and \s
that matches whitespace.
The following example shows a program that checks whether a string contains a digit.
Try the above program out without the
r
prefix to see what happens. As mentioned earlier, ther
prefix is used to create a raw string, which means that the backslashes are not interpreted as escape characters.
The special character sequences have also their negated versions, which are written with an uppercase letter. For example, \D
matches any character that is not a digit, \W
matches any character that is not a word character, and \S
matches any character that is not whitespace.
The following example shows how to use a regular expression to check if a string contains a non-digit character.
In regular expressions, the character .
is a special character that matches any character except a newline. If we would wish to just look for the occurrence of the character .
in a string, we would need to escape it with a backslash. The following example shows how to use a regular expression to check if a string contains the character .
.
Quantifiers
Quantifiers specify how many times a character or group of characters can appear. The most commonly used quantifiers are *
that matches zero or more occurrences, +
that matches one or more occurrences, and ?
that matches zero or one occurrence.
The following example shows how to use a regular expression quantifiers to check whether a string contains “hooray!” with one or more “o” characters and zero to an infinite number of exclamation marks.
The following example shows how to use regular expression quantifiers to check if a string contains a digit with decimal places — the regular expression is composed of checking whether there are one or more digits followed by a dot and one or more digits.
Quantifiers can also be used to define a number of times that a character or character group should appear. This is done with curly brackets {}
. For example, the regular expression a{2}
matches the character a
repeated 2 times, while the regular expression \d{6}
matches a digit repeated 6 times.
In the following, we use a regular expression to check if a string contains six decimals followed by a dash.
When you modify the above example to have seven digits, the regular expression still matches. This is because the method hasMatch
checks if the regular expression matches any part of the string.
By default, quantifiers are greedy, which means that they match as many characters as possible. If you would want to match as few characters as possible, you can use the non-greedy quantifiers *?
, +?
, and ??
.
Anchors
To match the position of a string, rather than the characters in the string, anchors are used. The most commonly used anchors are ^
that matches the beginning of a string, and $
that matches the end of a string.
If we would want to check that a given string contains exactly six decimals followed by a dash, and that the sequence appears at the beginning of the string, we could use the following regular expression.
Groups and extracting matches
Groups are used to group characters together, which allows applying quantifiers to the group. Groups are defined using parentheses ()
. The following example shows how to use a regular expression to check if a string contains a word followed by a space and a digit.
If we would wish to extract the matches from the string, we use the allMatches
method, which returns an iterable of Match objects. The Match
object contains the start and end indices of the match, as well as the matched string.
The following example shows how to use the allMatches
method to extract a pair consisting of a word and a digit, with a whitespace in between. The matches are then printed.
The group method of the match takes an integer index and returns the matched string for the group at that index. The index 0 returns the entire match, while the index 1 returns the first group, index 2 the second group, and so on.
The group method is nullable, which means that it returns
null
if the group does not exist. If you are sure that the group exists, you can add the!
operator to the end of the group method call to assert that the value is notnull
.
If we would wish to, for example, extract all decimal numbers from the text, we could modify the regular expression to match one or more digits followed by a dot and zero or more digits. The following example shows how to extract all decimal numbers from a string.
Alternatively, if we do not know whether the decimal number has a decimal point, we could modify the regular expression to match one or more digits followed by a dot and zero or more digits, or just one or more digits. The following example shows how to extract all decimal numbers from a string, regardless of whether they have a decimal point.
The RegExp class also has a method firstMatch
that returns the first match in the string, if one exists, otherwise null
(i.e., the method returns a nullable value), while the method matchAsPrefix
that is given both a string and an index, and returns a match if the string starts with the match at the given index, otherwise null
.
The following example shows how to use the firstMatch
method to extract the number found from a string, and then print the number if it exists.
The following shows the same, but using the matchAsPrefix
method to extract the number found from a string, starting from index 17. The index 17 is the start of the string “678.90 Stuff”.
Try modifying the index to 16, 18, and 19 to see how the method behaves when the index is before, at, and after the start of the match.
If you are unfamiliar with the regular expression syntax, it may seem a bit daunting at first. Complex regular expressions seem daunting also to experienced developers.
Generative AI tools are good at transforming strings from one form to another, and they can be used to generate regular expressions. Regardless, it is good to have a basic understanding of regular expressions, even if you use generative AI tools to generate them.