Pattern matching is more than just searching for some set of characters in your data; it’s a way of looking at data and processing that data in a manner that can be incredibly efficient and amazingly easy to program.
Pattern matching is the technique of searching a string containing text or binary data for some set of characters based on a specific search pattern. When you search for a string of characters in a file using the Find command in your word processor, or when you use a search engine to look for something on the Web, you're using a simple version of pattern matching: your criteria is "find these characters." In those environments, you can often customize your criteria in particular ways, for example, to search for this or that, to search for this or that but not the other thing, to search for whole words only, or to search only for those words that are 12 points and underlined. Pattern matching in Perl, however, can be even more complicated than that. Using Perl, you can define an incredibly specific set of search criteria, and do it in an incredibly small amount of space using a pattern-definition mini-language called regular expressions.
Perl's regular expressions, often called just regexes or REs, borrow from the regular expressions used in many Unix tools, such as grep(1) and sed(1). As with many other features Perl has borrowed from other places, however, Perl includes slight changes and lots of added capabilities. If you're used to using regular expressions, you'll be able to pick up Perl's regular expressions fairly easily, since most of the same rules apply (although there are some gotchas to be aware of, particularly if you've used sophisticated regular expressions in the past).
Note: The term regular expressions may seem sort of nonsensical. They don't really seem to be expressions, nor is it easy to figure out what's regular about them. Don't get hung up on the term itself; regular expression is a term borrowed from mathematics that refers to the actual language with which you write patterns for pattern matching in Perl.
I used the example of the search engine and the Find command earlier to describe the sorts of things that pattern matching can do. It’s important for you not to get hung up on thinking that pattern matching is only good for plain old searching. The sorts of things regular expressions can do in Perl include:
Making sure your user has entered the data you're looking for—input validation Verifying that input is in the right specific format, for example, that email addresses have the right components Extracting parts of a file that match a specific criteria (for example, you could extract the headings from a file to build a table of contents, or extract all the links in and HTML file). Splitting a string into elements based on different separator fields (and often, complex nested separator fields) Finding irregularities in a set of data—multiple spaces that don't belong there, duplicated words, errors in formatting Counting the number of occurrences of a pattern in a string Searching and replacing—find a string that matches a pattern and replace it with some other string
This is only a partial list, of course—you can apply Perl's regular expressions to all kinds of tasks. Generally, if there's a task for which you'd want to iterate over a string or over your data in another language, that task is probably better solved in Perl using regular expressions. Many of the operations you learned about yesterday for finding bits of strings can be better done with patterns. Pattern Matching Operators and Expressions
To use pattern matching in Perl, you figure out what you want to find, you write a regular expression to find it, and then you stick that pattern in a situation where the result of finding (or not finding) that pattern makes sense. As with other aspects of Perl, where you put a pattern and what context you use it in determines how that pattern is used.
We'll start with a fairly simple case—patterns in a boolean scalar context, where if a string contains the pattern, the expression returns true.
To construct patterns in this way, you use two operators: the regular expression operator m// and the pattern-match operator =~, like this:
if ($string =~ m/foo/) { # do something... }
What that test inside the if says is: if the string contained in $string contains the pattern foo, return true. Note that the =~ operator is not an assignment operator, even though it looks like one. =~ is used exclusively for pattern matching, and means, effectively, "find the pattern on the right somewhere in the string on the left." You'll sometimes find =~ called the binding operator.
The pattern itself is contained between the slashes in m//. This particular pattern is one of the simplest patterns you can create—it’s just three specific characters in sequence (you'll learn more about what constitutes a match and what doesn't later on). The pattern could just as easily be m/.*\d+/ or m/^[+-]?\d+\.?\d*$/ or some other seemingly incomprehensible set of characters (don't panic yet; you'll learn how to decipher those patterns soon).
For these sorts of patterns, the m is optional and can be left off the pattern itself (and usually is). In addition, you can leave off the variable and the =~ if you want to search the contents of the default variable $_. Commonly in Perl, you'll see shorthand pattern matching like this one:
if (/^\d+/) { # ...
Which is equivalent to
if ($_ =~ m/^\d+/) { # ...
You've already learned a simple case of this yesterday with the grep function, which can use patterns to find a bit of a string inside the $_ list element:
@foothings = grep /foo/, @strings;
That line, in turn, is equivalent to this long form:
@foothings = grep { $_ =~ /foo/ } @strings;
As we work through today's lesson, you'll learn different ways of using patterns in different contexts and for different reasons. Much of the work of learning pattern matching, however, involves actually learning the regular expression syntax to build patterns, so let's stick with this one situation for now. Simple Patterns
We'll start with some of the most simple and basic patterns you can create: patterns that match specific sequences of characters, patterns that match only at specific places in a string, or combining patterns using what's called alternation. Character Sequences
One of the simplest patterns is just a sequence of characters you want to match, like this:
/foo/
/this or that/
/ /
/Laura/
/patterns that match specific sequences/
All of these patterns will match if the data contains those characters in that order. All the characters must match, including spaces. The word or in the second pattern doesn't have any special significance (it’s not a logical or); that pattern will only match if the data contains the string this or that somewhere inside it.
Note that characters in patterns can be matched anywhere in a string. Word boundaries are not relevant for these patterns—the pattern /if/ will match in the string "if wishes were horses" and in the string "there is no difference." The pattern /if /, however, because it contains a space, will only match in the first string where the characters i, f, and the one space occur in that order.
Upper- and lowercase are relevant for characters: /kazoo/ will only match kazoo and not Kazoo or KAZOO. To make a particular search case-insensitive, you can use the i option after the pattern itself (the i indicates ignore case), like this:
/kazoo/i # search for any upper and lowercase versions
Alternately, you can also create patterns that will search for either upper- or lowercase letters, as you'll learn about in the next section.
You can include most alphanumeric characters in patterns, including string escapes for binary data (octal and hex escapes). There are a number of characters that you cannot match without escaping them. These characters are called metacharacters and refer to bits of the pattern language and not to the literal character. These are the metacharacters to watch out for in patterns:
^
$
.
+
?
*
{
(
)
\
/
|
[
If you want to actually match a metacharacter in a string—for example, search for an actual question mark—you can escape it using a backslash, just as you would in a regular string:
/\?/ # matches question mark Matching at Word or Line Boundaries
When you create a pattern to match a sequence of characters, those characters can appear anywhere inside the string and the pattern will still match. But sometimes you want a pattern to match those characters only if they occur at a specific place—for example, match /if/ only when it’s a whole word, or /kazoo/ only if it occurs at the start of the line (that is, the beginning of the string).
Note: I'm making an assumption here that the data you're searching is a line of input, where the line is a single string with no embedded newline characters. Given that assumption, the terms string, line, and data are effectively interchangeable. Tomorrow, we'll talk about how patterns deal with newlines.
To match a pattern at a specific position, you use pattern anchors. To anchor a pattern at the start of the string, use ^:
/^Kazoo/ # match only if Kazoo occurs at the start of the line
To match at the end of the string, use $:
/end$/ # match only if end occurs at the end of the line
Once again, think of the pattern as a sequence of things in which each part of the pattern must match the data you're applying it to. The pattern matching routines in Perl actually begin searching at a position just before the first character, which will match ^. Then it moves to each character in turn until the end of the line, where $ matches. If there's a newline at the end of the string, the position marked by $ is just before that newline character.
So, for example, let's see what happens when you try to match the pattern /^foo/ to the string "to be or not to be" (which, obviously, won't match, but let's try it anyhow). Perl starts at the beginning of the line, which matches the ^ character. That part of the pattern is true. It then tests the first character. The pattern wants to see an f there, but it got a t instead, so the pattern stops and returns false.
What happens if you try to apply the pattern to the string "fob"? The match will get farther—it'll match the start of the line, the f and the o, but then fail at the b. And keep in mind that /^foo/ will not match in the string " foo"—the foo is not at the very start of the line where the pattern expects it to be. It will only match when all four parts of the pattern match the string.
Some interesting but potentially tricky uses of ^ and $—can you guess what these patterns will match?
/^/
/^1$/
/^$/
The first pattern matches any strings that have a start of the line. It would be very weird strings indeed that didn't have the start of a line, so this pattern will match any string data whatsoever, even the empty string.
The second one wants to find the start of the line, the numeral 1, and then the end of the line. So it'll only match if the string contains 1 and only 1—it won't match "123" or "foo 1" or even " 1 ".
The third pattern will match only if the start of the line is immediately followed by the end of the line—that is, if there is no actual data. This pattern will only match an empty line. Keep in mind that because $ occurs just before the newline character, this last pattern will match both "" and "\n".
Another boundary to match is a word boundary—where a word boundary is considered the position between a word character (a letter, number, or underscore) and some other character such as whitespace or punctuation. A word boundary is indicated using a \b escape. So /\bif\b/ will match only when the whole word "if" exists in the string—but not when the characters i and f appear in the middle of a word (as in "difference."). You can use \b to refer to both the start and end of a word; /\bif/, for example, will match in both "if I were king" and "that result is iffy," and even in "As if!", but not in "bomb the aquifer" or "the serif is obtuse."
You can also search for a pattern not in a word boundary using the \B escape. With this, /\Bif/ will match only when the characters i and f occur inside a word and not at the start of a word. Matching Alternatives
Sometimes, when you're building a pattern, you may want to search for more than one pattern in the same string and then test based on whether all the patterns were found, or perhaps any of the set of patterns was found. You could, of course, do this with the regular Perl logical expressions for boolean AND (&& or and) and OR (|| or or) with multiple pattern-matching expressions, something like this:
if (($in =~ /this/) || ($in =~ /that/)) { ...
Then, if the string contains /this/ or if it contains /that/, the whole test will return true.
In the case of an OR search (match this pattern or that pattern—either one will work), however, there is a regular expression metacharacter you can use: the pipe character (|). So, for example, the long if test in that example could just be written as:
if ($in =~ /this|that/) { ...
Using the | character inside a pattern is officially known as alternation because it allows you to match alternate patterns. A true value for the pattern occurs if any of the alternatives match.
Any anchoring characters you use with an alternation character apply only to the pattern on the same side of the pipe. So, for example, the pattern /^this|that/ means "this at the start of the line" or "that anywhere," and not "either this or that at the start of a line." If you wanted the latter form you could use /^this|^that/, but a better way is to group your patterns using parentheses:
/^(this|that)/
For this pattern, Perl first matches the start of the line, and then tries and matches all the characters in "this." If it can't match "this", it'll then back up to the start of the line and try to match "that." For a pattern line /^this|that/, it'll first try and match everything on the left side of the pipe (start of line, followed by this), and if it can't do that, it'll back up and search the entire string for "that".
An even better version would be to group only the things that are different between the two patterns, not just the ^ to match the beginning of the line, but also the th characters, like this:
/^th(is|at)/
This last version means that Perl won't even try the alternation unless th has already been matched at the start of the line, and then there will be a minimum of backing up to match the pattern. With regular expressions, the less work Perl has to do to match something, the better.
You can use grouping for any kinds of alternation within a pattern. For example, /(1st|2nd|3rd|4th) time/ will match "1st time", "2nd time", and so on—as long as the data contains one of the alternations inside the parentheses and the string " time" (note the space). Matching Groups of Characters
So far, so good? The regular expressions we've been building so far shouldn't strike you as being that complex, particularly if you look at each pattern in the way that Perl looks at it, character by character and alternate by alternate, taking grouping into effect. Now we're going to start looking at some of the shortcuts that regular expressions provide for describing and grouping various kinds of characters. Character Classes
Say you had a string, and you wanted to match one of five words in that string: pet, get, met, set, and bet. You could do this:
/pet|get|met|set|bet/
That would work. Perl would search through the whole string for pet, then search through the whole string for get, then do the same thing for met, and so on. A shorter way—both for number of characters for you to type and for Perl—would be to group characters so that we don't duplicate the et part each time:
/(p|g|m|s|b)et/
In this case, Perl searches through the entire string for p, g, m, s, or b, and if it finds one of those, it'll try to match et just after it. Much more efficient!
This sort of pattern—where you have lots of alternates of single characters, is such a common case that there's regular expression syntax for it. The set of alternating characters is called a character class, and you enclose it inside brackets. So, for example, that same pet/get/met pattern would look like this using a character class:
/[pgmsb]et/
That's a savings of at least a couple of characters, and it’s even slightly easier to read. Perl will do the same thing as the alternation character, in this case: it'll look for any of the characters inside the character class before testing any of the characters outside it.
The rules for the characters that can appear inside a character class are different from those that can appear outside of one—most of the metacharacters become plain ordinary characters inside a character class (the exception being a right-bracket, which needs to be escaped for obvious reasons, a caret (^), which can't appear first, or a hyphen, which has a special meaning inside a character class). So, for example, a pattern to match on punctuation at the end of a sentence (punctuation after a word boundary and before two spaces) might look like this:
/\b[.!?] /
Whereas . and ? have special meanings outside the character class, here they're plain old characters. Ranges
What if you wanted to match, say, all the lowercase characters a through f (as you might in a hexadecimal number, for example). You could do:
/[abcdef]/
Looks like a job for a range, doesn't it? You can do ranges inside character classes, but you don't use the range operator .. that you learned about on Day 4. Regular expressions use a hyphen for ranges instead (which is why you have to backslash it if you actually want to match a hyphen). So, for example, lowercase a through f looks like this:
/[a-f]/
You can use any range of numbers or characters, as in /[0-9]/, /[a-z]/ or /[A-Z]/. You can even combine them: /[0-9a-z]/ will match the same thing as /[0123456789abcdefghijklmnopqrstuvwxyz]/. Negated Character Classes
Brackets define a class of characters to match in a pattern. You can also define a set of characters not to match using negated character classes—just make sure the first character in your character class is a caret (^). So, for example, to match anything that isn't an A or a B, use:
/[^AB]/
Note that the caret inside a character class is not the same as the caret outside one. The former is used to create a negated character class, and the latter is used to mean the beginning of a line.
If you want to actually search for the caret character inside a character class, you're welcome to—just make sure it’s not the first character or escape it (it might be best just to escape it either way to cut down on the rules you have to keep track of):
/[\^?.%]/ # search for ^, ?, ., %
You most likely end up using a lot of negated character classes in your regular expressions, so keep this syntax in mind. Note one subtlety: negated characters classes don't negate the entire value of the pattern. If /[12]/ means "return true if the data contains 1 or 2", /[^12]/ does not mean "return true if the data doesn't contain 1 or 2." If that were the case, you'd get a match even if the string in question was empty. What negated character classes really mean is "match any character that's not these characters." There must be at least one actual character to match for a negated character class to work. Special Classes
If character class ranges are still too much for you to type, there are also special character classes (and negated character classes) that have their own escape codes. You'll see these a lot in regular expressions, particularly those that match numbers in specific formats. Note that these special codes don't need to be enclosed between brackets; you can use them all by themselves to refer to that class of characters.
Code Equivalent character class What it means
\d [0-9] Any digit
\D [^0-9] Any character not a digit
\w [^0-9a-zA-z_] Any "word character"
\W [^0-9a-zA-z_] Any character not a word character
Word characters (\w and \W) is a bit mystifying—why is an underscore considered a word character, but punctuation isn't? In reality, word characters have little to do with words, but are the valid characters you can use in variable names: numbers, letters, and underscores. Any other characters are not considered word characters.
You can use these character codes anywhere you need a specific type of character. For example, the \d code to refers to any digit. With \d, you could create patterns that match any three digits /\d\d\d/, or, perhaps, any three digits, a dash, and any four digits, to represent a phone number such as 555-1212: /\d\d\d-\d\d\d\d/. All this repetition isn't necessarily the best way to go, however, as you'll learn in a bit when we cover quantifiers. Matching Any character with . (dot)
The broadest possible character class you can get is to match based on any character whatsoever. For that, you'd use the dot character (.). So, for example, the following pattern will match lines that contain one character and one character only:
/^.$/
You'll use the dot more often in patterns with quantifiers (which you'll learn about next), but the dot can be used to indicate fields of a certain width, for example:
/^..:/
This pattern will match only if the line starts with two characters and a colon.
More about the dot operator after we pause for an example. An Example: Optimizing Numspeller
Remember the numspeller script from yesterday? This was the script that took a single-digit number and converted it into a word. You many remember when I described the numspeller script that I mentioned it was easier to write using regular expressions. So, now that you know something of regular expressions, let's rewrite the script to use regular expressions instead of all those if statements.
And, while we're at it, why don't we revise the part of number speller that verifies the input. We can do a lot more in terms of input validation with regular expressions, to the point of absurdity. In fact, we'll approach absurdity with the input validation in this script. This version tests for a number of things that could be entered, and replies with various comments (many of them sarcastic):
code:
% numspeller2.pl Enter the number you want to spell(0-9): foo You can't fool me. There are letters in there. Enter the number you want to spell(0-9): 45foo You can't fool me. There are letters in there. Enter the number you want to spell(0-9): ### huh? That *really* doesn't look like a number Enter the number you want to spell(0-9): -45 That's a negative number. Positive only, please! Enter the number you want to spell(0-9): 789 Too big! 0 through 9, please. Enter the number you want to spell(0-9): 4 Thanks! Number 4 is four Try another number (y/n)?: x y or n, please Try another number (y/n)?: n %
Instead of showing you this script and then working through it line by line, let's go in the reverse direction: I'm going to show you sections from both the old and new versions of numspeller, explain them, and then at the end, I'll list the whole thing so you can get the big picture.
Let's start with the loop that accepts a number as input. This is what the loop looked like in the old version of numspeller:
code:
while () { print 'Enter the number you want to spell: '; chomp($num = <STDIN>); if ($num gt "9" ) { # test for strings print "No strings. 0 through 9 please..\n"; next; } if ($num > 9) { # numbers w/more than 1 digit print "Too big. 0 through 9 please.\n"; next; } if ($num < 0) { # negative numbers print "No negative numbers. 0 through 9 please.\n"; next; } last; }
We can easily replace the three tests in this loop with regular expressions that make more sense—and we can also test for more sophisticated kinds of things. Our new loop will test for three major groups of things:
Whether the input contains a single digit and only a single digit (in which case we're done). Whether the input contains any characters other than numbers Whether the number is larger than 9
That second test can then be broken into sub-tests for things like alphabetic characters, negative numbers (starting with -), floating-point numbers (with a decimal point), or totally bizarre characters. Here's the new version of our loop, which also makes use of the $_ variable to save us some typing in the pattern matching tests:
code:
1: while () { 2: print 'Enter the number you want to spell(0-9): '; 3: chomp($_ = <STDIN>); 4: if (/^\d$/) { # correct input 5: print "Thanks!\n"; 6: last; 7: } elsif (/^$/) { 8: print "You didn't enter anything.\n"; 9: } elsif (/\D/) { # nonnummbers 10: if (/[a-zA-z]/) { # letters 11: print "You can't fool me. There are letters in there.\n"; 12: } elsif (/^-\d/) { # negative numbers 13: print "That's a negative number. Positive only, please!\n"; 14: } elsif (/\./) { # decimals 15: print "That looks like it could be a floating-point number.\n"; 16: print "I can't spell a floating-point number. Try again.\n"; 17: } elsif (/[\W_]/) { # other chars 18: print "huh? That *really* doesn't look like a number\n"; 19: } 20: } elsif ($_ > 9) { 21: print "Too big! 0 through 9, please.\n"; 22: } 23: }
Let's look at those regular expressions, line by line, so you know what's getting matched here:
Line 4: /^\d$/ This pattern matches input with a single digit, and only a single digit—that is, it matches exactly the input we want to match. I stuck it up here at the top because if the user does enter the right value, we don't want to spend a lot of time cycling through all the options to figure out that they were right. This way, given this very specific match, if we get the correct input we can exit right out of the loop with last. Line 7: /^$/ As you learned in the section on matching at boundaries, this pattern matches an empty line—which is just what you get here if you hit return at the prompt without entering anything. Line 9 /\D/ This character code means "any characters other than numbers." If you type anything at the prompt that isn't a number—a mixture of numbers and letters, all letters, or with characters like -, ., or $—this pattern will match. This branches into a number of sub-tests for the specific non-number characters that got entered. Line 10 /[a-zA-z]/ These character class ranges look for actual characters from the alphabet. I didn't use the \w code here because that would have included the underscore, and I want to group the underscore into the any-other-character test instead. Line 12 /^-\d/ Here we're testing for a dash at the start of the line, immediately followed by a digit. This is the test for negative numbers. Line 14 /\./ Input containing a decimal point is probably a floating-point number. Note here that because . is a metacharacter for the pattern, we have to escape it to match an actual dot. Line 17 /[\W_] Here we use a character class of two things: any character that's not a word character (0-9, a-z, A-Z), or the underscore. This is the catch-all for all other characters that might have been entered. Line 20. No pattern here; this line catches input that is a number (so it won't get caught by most of the previous tests), but is a number with more than one digit. Here we'll just test the value to see if it’s bigger than 9 to catch those cases. There is actually a pattern than will match this, but you haven't learned it yet. This test works just as well.
The next part of the old numspeller script was a set of if...elsif loops that compared the input value to a number string. Using regular expressions, the default variable $_, and logical expressions used as conditionals, we can reduce the nested ifs that looked like this:
code:
if ($num == 1) { print 'one'; } elsif ($num == 2) { print 'two'; } elsif ($num == 3) { print 'three'; } elsif ($num == 4) { print 'four'; } # ... other numbers removed for space }
Into a set of logicals that look like this:
code:
/1/ && print 'one'; /2/ && print 'two'; /3/ && print 'three'; /4/ && print 'four'; # ... and so on
Cool, eh? It’s almost switch-like, and, arguably, easier to read.
Finally, we'll rewrite our little yes-or-no loop to repeat the entire script. The old version looked like this:
code:
while () { print 'Try another number (y/n)?: '; chomp ($exit = <STDIN>); $exit = lc $exit; if ($exit ne 'y' && $exit ne 'n') { print "y or n, please\n"; } else { last; } }
There's actually nothing terribly wrong with this version, but since this is the pattern matching lesson, let's use pattern matching here, too:
code:
while () { print 'Try another number (y/n)?: '; chomp ($exit = <STDIN>); $exit = lc $exit; if ($exit =~ /^[yn]/) { last; } else { print "y or n, please\n"; } }
Note the differences between this loop and the input loop. In the input loop, we stored the input in the $_ variable, so we could just put the pattern into the test itself. Here we're matching against the string in the $exit variable, so we have to use the =~ operator instead. In the pattern itself, we test to see if what was typed was either y or n (Y an N will get converted to lowercase with the lc function), and if so, exit the loop and return to the outer loop, which repeats the script if necessary.
Note: In this example, I've used quite a few regular expressions, many of them gratuitous. It’s worth mentioning at this point that you shouldn't necessarily use regular expressions everywhere simply because they're cool. The Perl regular expression engine is really powerful for really powerful things, but there is some overhead in terms of efficiency if you use it for simple things. Simple tests and if statements will often execute faster than regular expressions. If you're concerned about the efficiency of your code, keep that in mind. [code]#!/usr/bin/perl
0 comments, (681 reads) All Articles by, GentleGiant