Geeks and Bloggers Portal

About | Contact | Help

Home

Articles

Links

IT Freebies

Main Menu

Online Now

1 Member(s)
4 Guest(s)
24 Robot(s)
Log in to see who's on.

Most ever on: 794
Membership: 97

Home » Articles » Programming Design

The Whys and Wherefores of Pattern Matching

Published on 01/23/12 at 11:46:50 EST by GentleGiant

Programming Design

07/06/2020: Basic layout principles

07/06/2020: Typography principles

07/06/2020: Basic color theory

07/06/2020: Use of color, typography, and layout

07/06/2020: Designing in the browser

Pattern matching is more than just searching for some set of characters in your data; it’s a way of looking at data and processing that data in a manner that can be incredibly efficient and amazingly easy to program.

Pattern matching is the technique of searching a string containing text or binary data for some set of characters based on a specific search pattern. When you search for a string of characters in a file using the Find command in your word processor, or when you use a search engine to look for something on the Web, you're using a simple version of pattern matching: your criteria is "find these characters." In those environments, you can often customize your criteria in particular ways, for example, to search for this or that, to search for this or that but not the other thing, to search for whole words only, or to search only for those words that are 12 points and underlined. Pattern matching in Perl, however, can be even more complicated than that. Using Perl, you can define an incredibly specific set of search criteria, and do it in an incredibly small amount of space using a pattern-definition mini-language called regular expressions.

Perl's regular expressions, often called just regexes or REs, borrow from the regular expressions used in many Unix tools, such as grep(1) and sed(1). As with many other features Perl has borrowed from other places, however, Perl includes slight changes and lots of added capabilities. If you're used to using regular expressions, you'll be able to pick up Perl's regular expressions fairly easily, since most of the same rules apply (although there are some gotchas to be aware of, particularly if you've used sophisticated regular expressions in the past).

Note: The term regular expressions may seem sort of nonsensical. They don't really seem to be expressions, nor is it easy to figure out what's regular about them. Don't get hung up on the term itself; regular expression is a term borrowed from mathematics that refers to the actual language with which you write patterns for pattern matching in Perl.

I used the example of the search engine and the Find command earlier to describe the sorts of things that pattern matching can do. It’s important for you not to get hung up on thinking that pattern matching is only good for plain old searching. The sorts of things regular expressions can do in Perl include:

Making sure your user has entered the data you're looking for—input validation
Verifying that input is in the right specific format, for example, that email addresses have the right components
Extracting parts of a file that match a specific criteria (for example, you could extract the headings from a file to build a table of contents, or extract all the links in and HTML file).
Splitting a string into elements based on different separator fields (and often, complex nested separator fields)
Finding irregularities in a set of data—multiple spaces that don't belong there, duplicated words, errors in formatting
Counting the number of occurrences of a pattern in a string
Searching and replacing—find a string that matches a pattern and replace it with some other string

This is only a partial list, of course—you can apply Perl's regular expressions to all kinds of tasks. Generally, if there's a task for which you'd want to iterate over a string or over your data in another language, that task is probably better solved in Perl using regular expressions. Many of the operations you learned about yesterday for finding bits of strings can be better done with patterns.
Pattern Matching Operators and Expressions

To use pattern matching in Perl, you figure out what you want to find, you write a regular expression to find it, and then you stick that pattern in a situation where the result of finding (or not finding) that pattern makes sense. As with other aspects of Perl, where you put a pattern and what context you use it in determines how that pattern is used.

We'll start with a fairly simple case—patterns in a boolean scalar context, where if a string contains the pattern, the expression returns true.

To construct patterns in this way, you use two operators: the regular expression operator m// and the pattern-match operator =~, like this:

if ($string =~ m/foo/) {
# do something...
}

What that test inside the if says is: if the string contained in $string contains the pattern foo, return true. Note that the =~ operator is not an assignment operator, even though it looks like one. =~ is used exclusively for pattern matching, and means, effectively, "find the pattern on the right somewhere in the string on the left." You'll sometimes find =~ called the binding operator.

The pattern itself is contained between the slashes in m//. This particular pattern is one of the simplest patterns you can create—it’s just three specific characters in sequence (you'll learn more about what constitutes a match and what doesn't later on). The pattern could just as easily be m/.*\d+/ or m/^[+-]?\d+\.?\d*$/ or some other seemingly incomprehensible set of characters (don't panic yet; you'll learn how to decipher those patterns soon).

For these sorts of patterns, the m is optional and can be left off the pattern itself (and usually is). In addition, you can leave off the variable and the =~ if you want to search the contents of the default variable $_. Commonly in Perl, you'll see shorthand pattern matching like this one:

if (/^\d+/) { # ...

Which is equivalent to

if ($_ =~ m/^\d+/) { # ...

You've already learned a simple case of this yesterday with the grep function, which can use patterns to find a bit of a string inside the $_ list element:

@foothings = grep /foo/, @strings;

That line, in turn, is equivalent to this long form:

@foothings = grep { $_ =~ /foo/ } @strings;

As we work through today's lesson, you'll learn different ways of using patterns in different contexts and for different reasons. Much of the work of learning pattern matching, however, involves actually learning the regular expression syntax to build patterns, so let's stick with this one situation for now.
Simple Patterns

We'll start with some of the most simple and basic patterns you can create: patterns that match specific sequences of characters, patterns that match only at specific places in a string, or combining patterns using what's called alternation.
Character Sequences

One of the simplest patterns is just a sequence of characters you want to match, like this:

/foo/

/this or that/

/ /

/Laura/

/patterns that match specific sequences/

All of these patterns will match if the data contains those characters in that order. All the characters must match, including spaces. The word or in the second pattern doesn't have any special significance (it’s not a logical or); that pattern will only match if the data contains the string this or that somewhere inside it.

Note that characters in patterns can be matched anywhere in a string. Word boundaries are not relevant for these patterns—the pattern /if/ will match in the string "if wishes were horses" and in the string "there is no difference." The pattern /if /, however, because it contains a space, will only match in the first string where the characters i, f, and the one space occur in that order.

Upper- and lowercase are relevant for characters: /kazoo/ will only match kazoo and not Kazoo or KAZOO. To make a particular search case-insensitive, you can use the i option after the pattern itself (the i indicates ignore case), like this:

/kazoo/i # search for any upper and lowercase versions

Alternately, you can also create patterns that will search for either upper- or lowercase letters, as you'll learn about in the next section.

You can include most alphanumeric characters in patterns, including string escapes for binary data (octal and hex escapes). There are a number of characters that you cannot match without escaping them. These characters are called metacharacters and refer to bits of the pattern language and not to the literal character. These are the metacharacters to watch out for in patterns:

^


$

.


+

?


*

{


(

)


\

/


|

[


If you want to actually match a metacharacter in a string—for example, search for an actual question mark—you can escape it using a backslash, just as you would in a regular string:

/\?/ # matches question mark
Matching at Word or Line Boundaries

When you create a pattern to match a sequence of characters, those characters can appear anywhere inside the string and the pattern will still match. But sometimes you want a pattern to match those characters only if they occur at a specific place—for example, match /if/ only when it’s a whole word, or /kazoo/ only if it occurs at the start of the line (that is, the beginning of the string).

Note: I'm making an assumption here that the data you're searching is a line of input, where the line is a single string with no embedded newline characters. Given that assumption, the terms string, line, and data are effectively interchangeable. Tomorrow, we'll talk about how patterns deal with newlines.

To match a pattern at a specific position, you use pattern anchors. To anchor a pattern at the start of the string, use ^:

/^Kazoo/ # match only if Kazoo occurs at the start of the line

To match at the end of the string, use $:

/end$/ # match only if end occurs at the end of the line

Once again, think of the pattern as a sequence of things in which each part of the pattern must match the data you're applying it to. The pattern matching routines in Perl actually begin searching at a position just before the first character, which will match ^. Then it moves to each character in turn until the end of the line, where $ matches. If there's a newline at the end of the string, the position marked by $ is just before that newline character.

So, for example, let's see what happens when you try to match the pattern /^foo/ to the string "to be or not to be" (which, obviously, won't match, but let's try it anyhow). Perl starts at the beginning of the line, which matches the ^ character. That part of the pattern is true. It then tests the first character. The pattern wants to see an f there, but it got a t instead, so the pattern stops and returns false.

What happens if you try to apply the pattern to the string "fob"? The match will get farther—it'll match the start of the line, the f and the o, but then fail at the b. And keep in mind that /^foo/ will not match in the string " foo"—the foo is not at the very start of the line where the pattern expects it to be. It will only match when all four parts of the pattern match the string.

Some interesting but potentially tricky uses of ^ and $—can you guess what these patterns will match?

/^/

/^1$/

/^$/

The first pattern matches any strings that have a start of the line. It would be very weird strings indeed that didn't have the start of a line, so this pattern will match any string data whatsoever, even the empty string.

The second one wants to find the start of the line, the numeral 1, and then the end of the line. So it'll only match if the string contains 1 and only 1—it won't match "123" or "foo 1" or even " 1 ".

The third pattern will match only if the start of the line is immediately followed by the end of the line—that is, if there is no actual data. This pattern will only match an empty line. Keep in mind that because $ occurs just before the newline character, this last pattern will match both "" and "\n".

Another boundary to match is a word boundary—where a word boundary is considered the position between a word character (a letter, number, or underscore) and some other character such as whitespace or punctuation. A word boundary is indicated using a \b escape. So /\bif\b/ will match only when the whole word "if" exists in the string—but not when the characters i and f appear in the middle of a word (as in "difference."). You can use \b to refer to both the start and end of a word; /\bif/, for example, will match in both "if I were king" and "that result is iffy," and even in "As if!", but not in "bomb the aquifer" or "the serif is obtuse."

You can also search for a pattern not in a word boundary using the \B escape. With this, /\Bif/ will match only when the characters i and f occur inside a word and not at the start of a word.
Matching Alternatives

Sometimes, when you're building a pattern, you may want to search for more than one pattern in the same string and then test based on whether all the patterns were found, or perhaps any of the set of patterns was found. You could, of course, do this with the regular Perl logical expressions for boolean AND (&& or and) and OR (|| or or) with multiple pattern-matching expressions, something like this:

if (($in =~ /this/) || ($in =~ /that/)) { ...

Then, if the string contains /this/ or if it contains /that/, the whole test will return true.

In the case of an OR search (match this pattern or that pattern—either one will work), however, there is a regular expression metacharacter you can use: the pipe character (|). So, for example, the long if test in that example could just be written as:

if ($in =~ /this|that/) { ...

Using the | character inside a pattern is officially known as alternation because it allows you to match alternate patterns. A true value for the pattern occurs if any of the alternatives match.

Any anchoring characters you use with an alternation character apply only to the pattern on the same side of the pipe. So, for example, the pattern /^this|that/ means "this at the start of the line" or "that anywhere," and not "either this or that at the start of a line." If you wanted the latter form you could use /^this|^that/, but a better way is to group your patterns using parentheses:

/^(this|that)/

For this pattern, Perl first matches the start of the line, and then tries and matches all the characters in "this." If it can't match "this", it'll then back up to the start of the line and try to match "that." For a pattern line /^this|that/, it'll first try and match everything on the left side of the pipe (start of line, followed by this), and if it can't do that, it'll back up and search the entire string for "that".

An even better version would be to group only the things that are different between the two patterns, not just the ^ to match the beginning of the line, but also the th characters, like this:

/^th(is|at)/

This last version means that Perl won't even try the alternation unless th has already been matched at the start of the line, and then there will be a minimum of backing up to match the pattern. With regular expressions, the less work Perl has to do to match something, the better.

You can use grouping for any kinds of alternation within a pattern. For example, /(1st|2nd|3rd|4th) time/ will match "1st time", "2nd time", and so on—as long as the data contains one of the alternations inside the parentheses and the string " time" (note the space).
Matching Groups of Characters

So far, so good? The regular expressions we've been building so far shouldn't strike you as being that complex, particularly if you look at each pattern in the way that Perl looks at it, character by character and alternate by alternate, taking grouping into effect. Now we're going to start looking at some of the shortcuts that regular expressions provide for describing and grouping various kinds of characters.
Character Classes

Say you had a string, and you wanted to match one of five words in that string: pet, get, met, set, and bet. You could do this:

/pet|get|met|set|bet/

That would work. Perl would search through the whole string for pet, then search through the whole string for get, then do the same thing for met, and so on. A shorter way—both for number of characters for you to type and for Perl—would be to group characters so that we don't duplicate the et part each time:

/(p|g|m|s|b)et/

In this case, Perl searches through the entire string for p, g, m, s, or b, and if it finds one of those, it'll try to match et just after it. Much more efficient!

This sort of pattern—where you have lots of alternates of single characters, is such a common case that there's regular expression syntax for it. The set of alternating characters is called a character class, and you enclose it inside brackets. So, for example, that same pet/get/met pattern would look like this using a character class:

/[pgmsb]et/

That's a savings of at least a couple of characters, and it’s even slightly easier to read. Perl will do the same thing as the alternation character, in this case: it'll look for any of the characters inside the character class before testing any of the characters outside it.

The rules for the characters that can appear inside a character class are different from those that can appear outside of one—most of the metacharacters become plain ordinary characters inside a character class (the exception being a right-bracket, which needs to be escaped for obvious reasons, a caret (^), which can't appear first, or a hyphen, which has a special meaning inside a character class). So, for example, a pattern to match on punctuation at the end of a sentence (punctuation after a word boundary and before two spaces) might look like this:

/\b[.!?] /

Whereas . and ? have special meanings outside the character class, here they're plain old characters.
Ranges

What if you wanted to match, say, all the lowercase characters a through f (as you might in a hexadecimal number, for example). You could do:

/[abcdef]/

Looks like a job for a range, doesn't it? You can do ranges inside character classes, but you don't use the range operator .. that you learned about on Day 4. Regular expressions use a hyphen for ranges instead (which is why you have to backslash it if you actually want to match a hyphen). So, for example, lowercase a through f looks like this:

/[a-f]/

You can use any range of numbers or characters, as in /[0-9]/, /[a-z]/ or /[A-Z]/. You can even combine them: /[0-9a-z]/ will match the same thing as /[0123456789abcdefghijklmnopqrstuvwxyz]/.
Negated Character Classes

Brackets define a class of characters to match in a pattern. You can also define a set of characters not to match using negated character classes—just make sure the first character in your character class is a caret (^). So, for example, to match anything that isn't an A or a B, use:

/[^AB]/

Note that the caret inside a character class is not the same as the caret outside one. The former is used to create a negated character class, and the latter is used to mean the beginning of a line.

If you want to actually search for the caret character inside a character class, you're welcome to—just make sure it’s not the first character or escape it (it might be best just to escape it either way to cut down on the rules you have to keep track of):

/[\^?.%]/ # search for ^, ?, ., %

You most likely end up using a lot of negated character classes in your regular expressions, so keep this syntax in mind. Note one subtlety: negated characters classes don't negate the entire value of the pattern. If /[12]/ means "return true if the data contains 1 or 2", /[^12]/ does not mean "return true if the data doesn't contain 1 or 2." If that were the case, you'd get a match even if the string in question was empty. What negated character classes really mean is "match any character that's not these characters." There must be at least one actual character to match for a negated character class to work.
Special Classes

If character class ranges are still too much for you to type, there are also special character classes (and negated character classes) that have their own escape codes. You'll see these a lot in regular expressions, particularly those that match numbers in specific formats. Note that these special codes don't need to be enclosed between brackets; you can use them all by themselves to refer to that class of characters.

Code    Equivalent character class    What it means

\d [0-9] Any digit

\D [^0-9] Any character not a digit

\w [^0-9a-zA-z_] Any "word character"

\W [^0-9a-zA-z_] Any character not a word character

\s [ \t\n\r\f] whitespace (space, tab, newline, carriage return, form feed)

\S [^ \t\n\r\f] Any non-whitespace character

Word characters (\w and \W) is a bit mystifying—why is an underscore considered a word character, but punctuation isn't? In reality, word characters have little to do with words, but are the valid characters you can use in variable names: numbers, letters, and underscores. Any other characters are not considered word characters.

You can use these character codes anywhere you need a specific type of character. For example, the \d code to refers to any digit. With \d, you could create patterns that match any three digits /\d\d\d/, or, perhaps, any three digits, a dash, and any four digits, to represent a phone number such as 555-1212: /\d\d\d-\d\d\d\d/. All this repetition isn't necessarily the best way to go, however, as you'll learn in a bit when we cover quantifiers.
Matching Any character with . (dot)

The broadest possible character class you can get is to match based on any character whatsoever. For that, you'd use the dot character (.). So, for example, the following pattern will match lines that contain one character and one character only:

/^.$/

You'll use the dot more often in patterns with quantifiers (which you'll learn about next), but the dot can be used to indicate fields of a certain width, for example:

/^..:/

This pattern will match only if the line starts with two characters and a colon.

More about the dot operator after we pause for an example.
An Example: Optimizing Numspeller

Remember the numspeller script from yesterday? This was the script that took a single-digit number and converted it into a word. You many remember when I described the numspeller script that I mentioned it was easier to write using regular expressions. So, now that you know something of regular expressions, let's rewrite the script to use regular expressions instead of all those if statements.

And, while we're at it, why don't we revise the part of number speller that verifies the input. We can do a lot more in terms of input validation with regular expressions, to the point of absurdity. In fact, we'll approach absurdity with the input validation in this script. This version tests for a number of things that could be entered, and replies with various comments (many of them sarcastic):

code:

% numspeller2.pl 
Enter the number you want to spell(0-9): foo 
You can't fool me. There are letters in there. 
Enter the number you want to spell(0-9): 45foo 
You can't fool me. There are letters in there. 
Enter the number you want to spell(0-9): ### 
huh? That *really* doesn't look like a number 
Enter the number you want to spell(0-9): -45 
That's a negative number. Positive only, please! 
Enter the number you want to spell(0-9): 789 
Too big! 0 through 9, please. 
Enter the number you want to spell(0-9): 4 
Thanks! 
Number 4 is four 
Try another number (y/n)?: x 
y or n, please 
Try another number (y/n)?: n 
%

Instead of showing you this script and then working through it line by line, let's go in the reverse direction: I'm going to show you sections from both the old and new versions of numspeller, explain them, and then at the end, I'll list the whole thing so you can get the big picture.

Let's start with the loop that accepts a number as input. This is what the loop looked like in the old version of numspeller:

code:

while () { 
    print 'Enter the number you want to spell: '; 
    chomp($num = <STDIN>); 
    if ($num gt "9" ) { # test for strings 
        print "No strings.  0 through 9 please..\n"; 
        next; 
    } 
    if ($num > 9) { # numbers w/more than 1 digit 
        print "Too big. 0 through 9 please.\n"; 
        next; 
    } 
    if ($num < 0) { # negative numbers 
        print "No negative numbers.  0 through 9 please.\n"; 
        next; 
    } 
    last; 
}

We can easily replace the three tests in this loop with regular expressions that make more sense—and we can also test for more sophisticated kinds of things. Our new loop will test for three major groups of things:

Whether the input contains a single digit and only a single digit (in which case we're done).
Whether the input contains any characters other than numbers
Whether the number is larger than 9

That second test can then be broken into sub-tests for things like alphabetic characters, negative numbers (starting with -), floating-point numbers (with a decimal point), or totally bizarre characters. Here's the new version of our loop, which also makes use of the $_ variable to save us some typing in the pattern matching tests:

code:

1: while () { 
2:     print 'Enter the number you want to spell(0-9): '; 
3:     chomp($_ = <STDIN>); 
4:     if (/^\d$/) {  # correct input 
5:         print "Thanks!\n"; 
6:         last; 
7:     } elsif (/^$/) { 
8:         print "You didn't enter anything.\n"; 
9:     } elsif (/\D/) { # nonnummbers 
10:        if (/[a-zA-z]/) { # letters 
11:            print "You can't fool me.  There are letters in there.\n"; 
12:        } elsif (/^-\d/) { # negative numbers 
13:            print "That's a negative number.  Positive only, please!\n"; 
14:        } elsif (/\./) { # decimals 
15:            print "That looks like it could be a floating-point number.\n"; 
16:            print "I can't spell a floating-point number.  Try again.\n"; 
17:        } elsif (/[\W_]/) {  # other chars 
18:            print "huh?  That *really* doesn't look like a number\n"; 
19:        } 
20:    } elsif ($_ > 9) { 
21:        print "Too big!  0 through 9, please.\n"; 
22:    } 
23:  }

Let's look at those regular expressions, line by line, so you know what's getting matched here:

Line 4: /^\d$/ This pattern matches input with a single digit, and only a single digit—that is, it matches exactly the input we want to match. I stuck it up here at the top because if the user does enter the right value, we don't want to spend a lot of time cycling through all the options to figure out that they were right. This way, given this very specific match, if we get the correct input we can exit right out of the loop with last.
Line 7: /^$/ As you learned in the section on matching at boundaries, this pattern matches an empty line—which is just what you get here if you hit return at the prompt without entering anything.
Line 9 /\D/ This character code means "any characters other than numbers." If you type anything at the prompt that isn't a number—a mixture of numbers and letters, all letters, or with characters like -, ., or $—this pattern will match. This branches into a number of sub-tests for the specific non-number characters that got entered.
Line 10 /[a-zA-z]/ These character class ranges look for actual characters from the alphabet. I didn't use the \w code here because that would have included the underscore, and I want to group the underscore into the any-other-character test instead.
Line 12 /^-\d/ Here we're testing for a dash at the start of the line, immediately followed by a digit. This is the test for negative numbers.
Line 14 /\./ Input containing a decimal point is probably a floating-point number. Note here that because . is a metacharacter for the pattern, we have to escape it to match an actual dot.
Line 17 /[\W_] Here we use a character class of two things: any character that's not a word character (0-9, a-z, A-Z), or the underscore. This is the catch-all for all other characters that might have been entered.
Line 20. No pattern here; this line catches input that is a number (so it won't get caught by most of the previous tests), but is a number with more than one digit. Here we'll just test the value to see if it’s bigger than 9 to catch those cases. There is actually a pattern than will match this, but you haven't learned it yet. This test works just as well.

The next part of the old numspeller script was a set of if...elsif loops that compared the input value to a number string. Using regular expressions, the default variable $_, and logical expressions used as conditionals, we can reduce the nested ifs that looked like this:

code:

if ($num == 1) { print 'one'; } 
    elsif ($num == 2) { print 'two'; } 
    elsif ($num == 3) { print 'three'; } 
    elsif ($num == 4) { print 'four'; } 
    # ... other numbers removed for space 
}

Into a set of logicals that look like this:

code:

/1/ && print 'one'; 
/2/ && print 'two'; 
/3/ && print 'three'; 
/4/ && print 'four'; 
# ... and so on

Cool, eh? It’s almost switch-like, and, arguably, easier to read.

Finally, we'll rewrite our little yes-or-no loop to repeat the entire script. The old version looked like this:

code:

while () { 
    print 'Try another number (y/n)?: '; 
    chomp ($exit = <STDIN>); 
    $exit = lc $exit; 
    if ($exit ne 'y' && $exit ne 'n') { 
        print "y or n, please\n"; 
    } 
    else { last; } 
}

There's actually nothing terribly wrong with this version, but since this is the pattern matching lesson, let's use pattern matching here, too:

code:

while () { 
        print 'Try another number (y/n)?: '; 
        chomp ($exit = <STDIN>); 
        $exit = lc $exit; 
        if ($exit =~ /^[yn]/) { 
            last; 
        } 
        else { 
            print "y or n, please\n"; 
        } 
    }

Note the differences between this loop and the input loop. In the input loop, we stored the input in the $_ variable, so we could just put the pattern into the test itself. Here we're matching against the string in the $exit variable, so we have to use the =~ operator instead. In the pattern itself, we test to see if what was typed was either y or n (Y an N will get converted to lowercase with the lc function), and if so, exit the loop and return to the outer loop, which repeats the script if necessary.

Note: In this example, I've used quite a few regular expressions, many of them gratuitous. It’s worth mentioning at this point that you shouldn't necessarily use regular expressions everywhere simply because they're cool. The Perl regular expression engine is really powerful for really powerful things, but there is some overhead in terms of efficiency if you use it for simple things. Simple tests and if statements will often execute faster than regular expressions. If you're concerned about the efficiency of your code, keep that in mind.
[code]#!/usr/bin/perl
0 comments, (681 reads) All Articles by, GentleGiant

Printer Friendly version - The Whys and Wherefores of Pattern Matching

Comments on this article:

1. use cPanelUserConfig;
# numberspeller: prints out word approximations of numbers
# simple version, only does single-digits

$exit = ""; # whether or not to exit the script.

while ($exit ne "n") {

while () {
print 'Enter the number you want to spell(0-9): ';
chomp($_ = <STDIN>);
if (/^\d$/) {
print "Thanks!\n";
last;
} elsif (/^$/) {
print "You didn't enter anything.\n";
} elsif (/\D/) { # nonnummbers
if (/[a-zA-z]/) { # letters
print "You can't fool me. There are letters in there.\n";
} elsif (/^-\d/) { # negative numbers
print "That's a negative number. Positive only, please!\n";
} elsif (/\./) { # decimals
print "That looks like it could be a floating-point number.\n";
print "I can't spell a floating-point number. Try again.\n";
} elsif (/[\W_]/) { # other chars
print "huh? That *really* doesn't look like a number\n";
}
} elsif ($_ > 9) {
print "Too big! 0 through 9, please.\n";
}
}

print "Number $_ is ";
/1/ && print 'one';
/2/ && print 'two';
/3/ && print 'three';
/4/ && print 'four';
/5/ && print 'five';
/6/ && print 'six';
/7/ && print 'seven';
/8/ && print 'eight';
/9/ && print 'nine';
/0/ && print 'zero';
print "\n";

while () {
print 'Try another number (y/n)?: ';
chomp ($exit = <STDIN>);
$exit = lc $exit;
if ($exit =~ /^[yn]/) {
last;
}
else {
print "y or n, please\n";
}
}
}[/code]

Matching Multiple Instances of Characters

Ready for more? The second group of regular expression syntax to explore is that of quantifiers. Whereas the patterns you've seen up to now refer to individual things or groups of individual things, quantifiers allow you to indicate multiples instances of things—or potentially no things. These regular expression metacharacters are called quantifiers, since they indicate some quantity of characters or groups of characters in the pattern you're looking for.

Perl's regular expressions include three quantifier metacharacters: ?, *, and +. Each refers to some multiple of the character or group that appears just before it in the pattern.
Optional Characters with ?

Let's start with ?, which matches a sequence that may or may not have the character immediately preceding it (that is, it matches zero or one instance of that character). So, for example, take this pattern:

/be?ar/

The question mark in that pattern refers to the character preceding it (e). This pattern would match with the string "step up to the bar" and with the string "grin and bear it"—because both "bar" and "bear" will match this pattern. The string you're searching must have the b, the a, and the r, but the e is optional.

Once again, think in terms of how the string is processed. The b is matched first. Then the next character is tested. If it’s an e, no problem, we move on to the next character both in the string and in the pattern (the a). If it’s not an e, that's still no problem, we move onto the next character in the pattern to see if it matches instead.

You can create groups of optional characters with parentheses:

/bamboo(zle)?/

The parentheses make that whole group of characters (zle) optional—this pattern will match both bamboo or bamboozle The thing just before the ? is the optional thing, be it a single character or a group.

Note: Why bother creating a pattern like this? It would seem that the (zle) part of this pattern is irrelevant, and that just plain /bamboo/ would work just as well, with fewer characters. In these easy cases, where we're just trying to find out whether something matches, yes or no, it doesn't matter. Tomorrow, when you learn how to extract the thing that matched and create more complex patterns, the distinction will be more important.

You can also use character classes with ?:

/thing \d?/

This pattern will match the strings "thing 1", "thing 9," and so on, but will also match "thing " (note the space). Any character in the character class can appear either zero or one times for the pattern to match.
Multiple Characters with *

A second form of multiplier is the *, which works similarly to the ? except that * allows zero or any number of the preceding character to appear—not just zero or one instance as ? does. Take this pattern:

/xy*z/

In this pattern, the x and the z are required, but the y can appear any number of times including not at all. This pattern will match xyz, xyyz, xyyyyyyyyyyyyyyyyz, or just plain old xz without the y.

As with ?, you can use groups or character classes before the *. One use of * is to use it with the dot character—which means that any number of any characters could appear at that position:

/this.*/

This pattern matches the strings "thisthat", "this is not my sweater. The blue one with the flowers is mine," or even just "this"—remember, the character at the end doesn't have to exist for there to be a match.

A common mistake is to forget that * stands for "zero or more instances," and to use it like this:

if (/^[0-9]*$/) {
# contains numbers
}

The intent here is to create a pattern that matches only if the input contains numbers and only numbers. And this pattern will indeed match "7," "1540," "15443" and so on. But it'll also match the empty string—because the * means that no numbers whatsoever will also produce a match. Usually, when you want to require something to appear at least once, you want to use + instead of *.

Note also that "match zero or more numbers," as that example would imply, does not mean that it will match any string that happens to have zero numbers—it won't match the string "lederhosen", for example. Matching zero or more numbers does not imply any other matches; if you want it to match characters than numbers, you'll need to include those characters in the pattern. With regular expressions, you have to be very specific about what you want to match.
Requiring at Least Once Instance with +

The + metacharacter works identically to *, with one significant difference; instead of allowing zero or more instances of the given character or group, requires that character or group to appear at least once ("one or more instances."). So given a pattern like the one we used for *:

/xy+z/

This pattern will match "xyz", "xyyz," xyyyyyyyyyyz", but it will not match "xz." The y must appear at least once.

As with * and ?, you can use groups and character classes with +.
Restricting the Number of Instances

For both * and + the given character or group can appear any number of times—there is no upper limit (characters with ? can appear only once). But what if you want to match a specific number of instances? What if the pattern you're looking for does require a lower or upper limit, and any more or less than that won't match? You can use the optional curly bracket metacharacters to set limits on the quantity, like this:

/\d{1,4} /

This pattern matches if the data includes one digit, two digits, three digits, or four digits, any of them followed by a space; it won't match any more digits than that, nor will it match if there aren't any digits whatsoever. The first number inside the brackets is the minimum number of instances to match; the second is the maximum. Or you can match an exact number by just including the number itself:

/a{5}b/

This pattern will only match if it can find five as in a row followed by one b—no more, no less. It’s exactly equivalent to /aaaaab/. A less specific use of {} for an exact number of instances might be something like this:

/\$\d+\.\d{2}/

Can you work through this pattern and figure out what it matches? It uses a number of escaped characters, so it might be confusing. First, it matches a dollar sign (\$), then one or more decimals (\d+), then it matches a decimal point (.), and finally, it matches only if that pattern is followed by two decimals and no more. Put it all together and this pattern matches monetary input—$45.23 would match just fine, as would $0.45 or $15.00, but $.45 and $34.2 would not. This pattern requires at least one number on the left side of the decimal, and a maximum of two numbers on the right.

Back to the curly brackets. You can set a lower bound on the match, but not an upper bound, by leaving off the maximum number but keeping the comma:

/ba{4,}t/

This pattern matches b, at least four or more instances of the letter a, and then t. Three instances of a in a row won't match, but twenty as will.

Note that you could represent +, * and ? in curly bracket format:

/x{0,1}/ # same as /x?/

/x{0,}/ # same as /x*/

/x{1,}/ # same as /x+/
More About Building Patterns

We started this lesson with a basic overview of how to use patterns in your Perl scripts using an if test and the =~ operator—or, if you're searching in $_, you can leave off the =~ part altogether. Now that you know something of constructing patterns with regular expression syntax, let's return to Perl, and look at some different ways of using patterns in your Perl scripts, including interpolating variables into patterns and using patterns in loops.
Patterns and Variables

In all the examples so far, we've used patterns as hard-coded sets of characters in the test of a Perl script. But what if you want to match different things based on some sort of input? How do you change the search pattern on the fly?

Easy. Patterns, like quotes, can contain variables, and the value of the variable is substituted into the pattern:

$pattern = "^\d{3}$";

if (/$pattern/) { ...

The variable in question can contain a string with any kind of pattern, including metacharacters. You can use this technique to combine patterns in different ways, or to search for patterns based on input. For example, here's a simple script that prompts you for both a pattern and some data to search, and then returns true or false if there's a match:
[code]#!/usr/bin/perl Written on 0-1/31/00 at 19:00:00 EST by Anonymous

2. use cPanelUserConfig;

print 'Enter the pattern: ';
chomp($pat = <STDIN>);

print 'Enter the string: ';
chomp($in = <STDIN>);

if ($in =~ /$pat/) { print "true\n"; }
else { print "false\n"; }[/code]

You may find this script (or one like it) useful yourself, as you learn more about regular expressions.
Patterns and Loops

One way of using patterns in Perl scripts is to use them as tests, as we have up to this point. In this context (a scalar boolean context), they evaluate to true or false based on whether the pattern matches the data. Another way to use a pattern is as the test in a loop, with the /g option at the end of the pattern, like this:

[code]while (/pattern/g) {
# loop
}[/code]

The /g option is used to match all the patterns in the given string (here, $_, but you can use the =~ operator to match somewhere else). In an if test, the /g option won't matter, case the test will return true at the first match it finds. In the case of while (or a for loop), however, the /g will cause the test to return true each time the pattern occurs in the string—and the statements in the block will execute that number of times as well.

Note: We're still talking about using patterns in a scalar context, here; the /g just causes interesting things to happen in loops. We'll get to using patterns in list context tomorrow.
Another Example: Counting

Here's an example of a script that makes use of that patterns-in-loops feature I just mentioned to work through a file (or any numbers of files) and count the incidences of some pattern in that file. With this script you could, for example, count the number of times your name occurs in a file, or find out how many hits to your Web site came from America Online (aol.com).

[code]1: #!/usr/bin/perl Written on 0-1/31/00 at 19:00:00 EST by Anonymous

3. use cPanelUserConfig;
2:
3: $pat = ""; # thing to search for
4: $count = 0; # number of times it occurs
5:
6: print 'Search for what? ';
7: chomp($pat = <STDIN>);
8: while (<>) {
9: while (/$pat/g) {
10: $count++;
11: }
12: }
13:
14: print "Found /$pat/ $count times.\n";[/code]

As with all the scripts we've built that cycle through files using <>, you'll have to call this one on the command line with the name of a file:
[code] % count.pl logfile
Search for what? aol.com
Found /aol.com/ 3456 times.
%[/code]

Nothing should look overly surprising, although there are a few points to note. Remember that using while with the file input characters (<>) sets each line of input to the default variable $_. Since patterns will also match with that value by default, we don't need a temporary variable to hold each line of input. The first while loop (line 8), then, reads each line from the input files. The second while loop searches that single line of input repeatedly and increments $count each time it finds the pattern in each line. This way, we can get the total number of instances of the given pattern, both inside each line and for all the lines in the input.

One other important thing to note about this script: if you have it search for a phrase instead of a single word—for example, find all instances of both a first and last name—then there is a possibility that that phrase could fall across multiple lines. This script will miss those instances, since neither line will completely match the pattern. Written on 0-1/31/00 at 19:00:00 EST by Anonymous

The comments are owned by the poster. We aren't responsible for its content.
Only registered members may comment on articles.

Recent Discussions

About | Contact | Help | Recommend | Statistics

RSS Feed The Whys and Wherefores of Pattern Matching

This site is part of the Detroit Metro Area Networks

*******************************