spacer

Webref WebRef   Sitemap · Experts · Tools · Services · Newsletters · About i.com

home / programming / perl / mod_perl / chap6 / 2 To page 1To page 2To page 3To page 4To page 5current pageTo page 7To page 8
[previous] [next]

Practical mod_perl: Chapter 6: Coding with mod_perl in Mind

Java Developer (IL)
Next Step Systems
US-IL-Chicago

Justtechjobs.com Post A Job | Post A Resume
Developer News
OpenOffice 3.2 Lands Amid Critical Changes
Red Hat, IBM Firmly in KVM Virtualization Camp
Red Hat Talks Up Open Source Cloud Plans


Compiled Regular Expressions

When using a regular expression containing an interpolated Perl variable that you are confident will not change during the execution of the program, a standard speed-optimization technique is to add the /o modifier to the regex pattern. This compiles the regular expression once, for the entire lifetime of the script, rather than every time the pattern is executed. Consider:

my $pattern = '^\d+$'; # likely to be input from an HTML form field
foreach (@list) {
    print if /$pattern/o;
}

This is usually a big win in loops over lists, or when using the grep( ) or map( ) operators.

In long-lived mod_perl scripts and handlers, however, the variable may change with each invocation. In that case, this memorization can pose a problem. The first request processed by a fresh mod_perl child process will compile the regex and perform the search correctly. However, all subsequent requests running the same code in the same process will use the memorized pattern and not the fresh one supplied by users. The code will appear to be broken.

Imagine that you run a search engine service, and one person enters a search keyword of her choice and finds what she's looking for. Then another person who happens to be served by the same process searches for a different keyword, but unexpectedly receives the same search results as the previous person.

There are two solutions to this problem.

The first solution is to use the eval q// construct to force the code to be evaluated each time it's run. It's important that the eval block covers the entire processing loop, not just the pattern match itself.

The original code fragment would be rewritten as:

my $pattern = '^\d+$';
eval q{
    foreach (@list) {
        print if /$pattern/o;
    }
}

If we were to write this:

foreach (@list) {
    eval q{ print if /$pattern/o; };
}

the regex would be compiled for every element in the list, instead of just once for the entire loop over the list (and the /o modifier would essentially be useless).

However, watch out for using strings coming from an untrusted origin inside eval--they might contain Perl code dangerous to your system, so make sure to sanity-check them first.

This approach can be used if there is more than one pattern-match operator in a given section of code. If the section contains only one regex operator (be it m// or s///), you can rely on the property of the null pattern, which reuses the last pattern seen. This leads to the second solution, which also eliminates the use of eval.

The above code fragment becomes:

my $pattern = '^\d+$';
"0" =~ /$pattern/; # dummy match that must not fail!
foreach (@list) {
    print if //;
}

The only caveat is that the dummy match that boots the regular expression engine must succeed--otherwise the pattern will not be cached, and the // will match everything. If you can't count on fixed text to ensure the match succeeds, you have two options.

If you can guarantee that the pattern variable contains no metacharacters (such as *, +, ^, $, \d, etc.), you can use the dummy match of the pattern itself:

$pattern =~ /\Q$pattern\E/; # guaranteed if no metacharacters present

The \Q modifier ensures that any special regex characters will be escaped.

If there is a possibility that the pattern contains metacharacters, you should match the pattern itself, or the nonsearchable \377 character, as follows:

"\377" =~ /$pattern|^\377$/; # guaranteed if metacharacters present

Matching patterns repeatedly

Another technique may also be used, depending on the complexity of the regex to which it is applied. One common situation in which a compiled regex is usually more efficient is when you are matching any one of a group of patterns over and over again.

To make this approach easier to use, we'll use a slightly modified helper routine from Jeffrey Friedl's book Mastering Regular Expressions (O'Reilly):

sub build_match_many_function {
    my @list = @_;
    my $expr = join '||', 
        map { "\$_[0] =~ m/\$list[$_]/o" } (0..$#list);
    my $matchsub = eval "sub { $expr }";
    die "Failed in building regex @list: $@" if $@;
    return $matchsub;
}

This function accepts a list of patterns as an argument, builds a match regex for each item in the list against $_[0], and uses the logical || (OR) operator to stop the matching when the first match succeeds. The chain of pattern matches is then placed into a string and compiled within an anonymous subroutine using eval. If eval fails, the code aborts with die( ); otherwise, a reference to this subroutine is returned to the caller.

Here is how it can be used:

my @agents = qw(Mozilla Lynx MSIE AmigaVoyager lwp libwww);
my $known_agent_sub = build_match_many_function(@agents);
 
while (<ACCESS_LOG>) {
    my $agent = get_agent_field($_);
    warn "Unknown Agent: $agent\n"
        unless $known_agent_sub->($agent);
}

This code takes lines of log entries from the access_log file already opened on the ACCESS_LOG file handle, extracts the agent field from each entry in the log file, and tries to match it against the list of known agents. Every time the match fails, it prints a warning with the name of the unknown agent.

An alternative approach is to use the qr// operator, which is used to compile a regex. The previous example can be rewritten as:

my @agents = qw(Mozilla Lynx MSIE AmigaVoyager lwp libwww);
my @compiled_re = map qr/$_/, @agents;
 
while (<ACCESS_LOG>) {
    my $agent = get_agent_field($_);
    my $ok = 0;
    for my $re (@compiled_re) {
        $ok = 1, last if /$re/;
    }
    warn "Unknown Agent: $agent\n"
        unless $ok;
}

In this code, we compile the patterns once before we use them, similar to build_match_many_function( ) from the previous example, but now we save an extra call to a subroutine. A simple benchmark shows that this example is about 2.5 times faster than the previous one.

home / programming / perl / mod_perl / chap6 / 2 To page 1To page 2To page 3To page 4To page 5current pageTo page 7To page 8
[previous] [next]


The Network for Technology Professionals

Search:

About Internet.com

Legal Notices, Licensing, Permissions, Privacy Policy.
Advertise | Newsletters | E-mail Offers

webref The latest from WebReference.com Browse >
Search Engine Optimization: Selecting and Embedding Keywords · Are Google's Language Translation Web Services Ready for Prime Time? · Installing and Using Meeplace, the Business Review CMS
Sitemap · Experts · Tools · Services · Email a Colleague · Contact FREE Newsletters 
 The latest from internet.com
IBM DB2 10 for z/OS: Justifying the Upgrade · Living La Vida Colo: Choosing the Right Colocation Facility · FTC Concerns over Social Media Privacy Linger

Created: March 27, 2003
Revised: July 23, 2003

URL: http://webreference.com/programming/perl/mod_perl/chap6/2