| home / programming / perl / mod_perl / chap6 / 2 | [previous] [next] |
|
|
When using a regular expression containing an interpolated Perl
variable that you are confident will not change during the execution of the
program, a standard speed-optimization technique is to add the /o
modifier to the regex pattern. This compiles the regular expression once, for
the entire lifetime of the script, rather than every time the pattern is executed.
Consider:
my $pattern = '^\d+$'; # likely to be input from an HTML form fieldforeach (@list) {print if /$pattern/o;}
This is usually a big win in loops over lists, or when using the
grep( ) or map( )
operators.
In long-lived mod_perl scripts and handlers, however, the variable may change with each invocation. In that case, this memorization can pose a problem. The first request processed by a fresh mod_perl child process will compile the regex and perform the search correctly. However, all subsequent requests running the same code in the same process will use the memorized pattern and not the fresh one supplied by users. The code will appear to be broken.
Imagine that you run a search engine service, and one person enters a search keyword of her choice and finds what she's looking for. Then another person who happens to be served by the same process searches for a different keyword, but unexpectedly receives the same search results as the previous person.
There are two solutions to this problem.
The first solution is to use the eval q//
construct to force the code to be evaluated each time it's run. It's important
that the eval block covers the entire processing
loop, not just the pattern match itself.
The original code fragment would be rewritten as:
my $pattern = '^\d+$';eval q{foreach (@list) {print if /$pattern/o;}}
If we were to write this:
foreach (@list) {eval q{ print if /$pattern/o; };}
the regex would be compiled for every element in the list, instead
of just once for the entire loop over the list (and the /o
modifier would essentially be useless).
However, watch out for using strings coming from an untrusted
origin inside eval--they might contain Perl code
dangerous to your system, so make sure to sanity-check them first.
This approach can be used if there is more than one pattern-match
operator in a given section of code. If the section contains only one regex
operator (be it m// or s///),
you can rely on the property of the null pattern,
which reuses the last pattern seen. This leads to the second solution, which
also eliminates the use of eval.
The above code fragment becomes:
my $pattern = '^\d+$';"0" =~ /$pattern/; # dummy match that must not fail!foreach (@list) {print if //;}
The only caveat is that the dummy match that boots the regular
expression engine must succeed--otherwise the pattern
will not be cached, and the // will match everything.
If you can't count on fixed text to ensure the match succeeds, you have two
options.
If you can guarantee that the pattern variable contains no metacharacters
(such as *, +, ^,
$, \d, etc.),
you can use the dummy match of the pattern itself:
$pattern =~ /\Q$pattern\E/; # guaranteed if no metacharacters present
The \Q modifier ensures that
any special regex characters will be escaped.
If there is a possibility that the pattern contains metacharacters,
you should match the pattern itself, or the nonsearchable \377
character, as follows:
"\377" =~ /$pattern|^\377$/; # guaranteed if metacharacters present
Another technique may also be used, depending on the complexity of the regex to which it is applied. One common situation in which a compiled regex is usually more efficient is when you are matching any one of a group of patterns over and over again.
To make this approach easier to use, we'll use a slightly modified helper routine from Jeffrey Friedl's book Mastering Regular Expressions (O'Reilly):
sub build_match_many_function {my @list = @_;my $expr = join '||',map { "\$_[0] =~ m/\$list[$_]/o" } (0..$#list);my $matchsub = eval "sub { $expr }";die "Failed in building regex @list: $@" if $@;return $matchsub;}
This function accepts a list of patterns as an argument, builds
a match regex for each item in the list against $_[0],
and uses the logical || (OR) operator to stop the
matching when the first match succeeds. The chain of pattern matches is then
placed into a string and compiled within an anonymous subroutine using eval.
If eval fails, the code aborts with die(
); otherwise, a reference to this subroutine is returned to the caller.
Here is how it can be used:
my @agents = qw(Mozilla Lynx MSIE AmigaVoyager lwp libwww);my $known_agent_sub = build_match_many_function(@agents);while (<ACCESS_LOG>) {my $agent = get_agent_field($_);warn "Unknown Agent: $agent\n"unless $known_agent_sub->($agent);}
This code takes lines of log entries from the access_log
file already opened on the ACCESS_LOG file handle,
extracts the agent field from each entry in the log file, and tries to match
it against the list of known agents. Every time the match fails, it prints a
warning with the name of the unknown agent.
An alternative approach is to use the qr//
operator, which is used to compile a regex. The previous example can be rewritten
as:
my @agents = qw(Mozilla Lynx MSIE AmigaVoyager lwp libwww);my @compiled_re = map qr/$_/, @agents;while (<ACCESS_LOG>) {my $agent = get_agent_field($_);my $ok = 0;for my $re (@compiled_re) {$ok = 1, last if /$re/;}warn "Unknown Agent: $agent\n"unless $ok;}
In this code, we compile the patterns once before we use them,
similar to build_match_many_function( ) from the
previous example, but now we save an extra call to a subroutine. A simple benchmark
shows that this example is about 2.5 times faster than the previous one.
| home / programming / perl / mod_perl / chap6 / 2 | [previous] [next] |
Created: March 27, 2003
Revised: July 23, 2003
URL: http://webreference.com/programming/perl/mod_perl/chap6/2