Regular Expressions: Matching Rules - Doc JavaScript | WebReference

Regular Expressions: Matching Rules - Doc JavaScript

Unix Regular Expressions

Matching Rules

The Perl Engine uses a nondeterministic finite-state automation (NFA) to find a match. That is, it keeps track of what it has tried and what it hasn't. When a pattern doesn't match, the engine backs up and tries something else. This is called backtracking.

The order of the rules specifies by which order the Engine tries matching. The phrase "left-most, longest match" means that overall Perl prefers the left-most over longest.

As the rules are defined recursively, it may take some time getting used to them. These rules are based on the popular "Programming Perl" book.

Rule 1

The Engine tries to match as far left in the string as it can, so that the entire regular expression (not necessarily the entire string) matches according to Rule 2. In order to do this, its first choice is to start just before the first character, and then try to match the entire regular expression. The regular expression matches if and only if the Engine reaches the end of the regular expression before it runs off the end of the string. If it matches, it quits immediately. Otherwise, the process continues. The match doesn't have to reach the end of the string, unless there's an assertion in the regular expression that instructs it to do so. Before advancing to the second position (between the first and second characters), the Engine tries all of the possibilities, from right to left. If it exhausts all possibilities at the first position, it realizes that its very first choice was wrong, and proceeds to its second choice. It goes to the second position in the string and tries all the possibilities again. The pattern match as a whole doesn't fail until it has tried to match the entire regular expression at every position in the string.

The positions the Engine tries to match are between the characters. Consider the pattern /t*/, which can match zero or more t's. If you try this pattern on the string "Bart" it matches the null string before the first character of the string, rather than matching the "t" later in the string. Any regular expression that can match the null string (zero-width) is guarenteed to match at the leftmost position in the string. The following sequence of strings shows how the engine matches the pattern /ar/ to the string "Bart". The red code highlights the substring that the Engine is testing against the pattern:

Bart // Does "Bart" match? ... NO
Bart // Does "t" match? ... NO
Bart // Does "rt" match? ... NO
Bart // Does "art" match? ... NO
Bart // Does "" (the null string) match? ... NO
Bart // Does "art" match? ... NO
Bart // Does "ar" match? ... YES

Rule 2

For this rule, the whole regular expression is regarded as a set of alternatives, separated by vertical bars (|). A regular expression can obviously consist of only one alternative, in which case it does not have any vertical bar. There's no such thing as zero alternatives because a null string can always match something of zero width.

A set of alternatives matches a string if any of the alternatives match under Rule 3. It tries the alternatives from left-to-right, and stops on the first match that allows successful completion of the entire regular expression. If none of the alternatives match, the Engine backtracks to the Rule that invoked this Rule, which is usually Rule 1, but could be Rule 4 or 6. That Rule then looks for a new position at which to apply Rule 2.

Rule 3

Any specific alternative matches if every item in the alternatives matches sequentially according to Rules 4 and 5. An item can consist of an assertion or a quantified atom. If the item cannot be matched in order (the literal order), the Engine backtracks to the next alternative according to Rule 2 (if no alternatives are left, there is no match).

Sequential items aren't separated by any sort of punctuation. They are simply specified in the order they must match. Take a look at the following pattern:


It consists of five items to be matched in order. The first one is a zero-width assertion (matches the beginning of a string), and the other four are simple characters that match themselves sequentially.

Items that have multiple choices (such as "match one or more...") are given "pecking-order" from left to right. This means that in a pattern like:


x picks one way to match, and then y tries all its ways to match. If that fails, x picks another way to match, and then y tries all its ways again. This process continues until a match is found or until no options remain. Think of this as scanning a multi-dimensional array, where y reflects the counter of the inner loop, and x reflects the counter of the outer loop. Since y varies faster, one can borrow a phrase from multi-dimensional arrays: "The items to the right vary faster."

Rule 4

If an assertion does not match according to the following table, the Engine backtracks to Rule 3 and tries a higher-pecking-order item with different choices.

^Matches at the beginning of the string.
$Matches at the end of the string.
\bMatches a word boundary (between \w and \W), when not inside [].
\BMatches a non-word boundary.

Note that Perl supports several other assertions, such as \A and \Z, which match the beginning and end of a string, respectively. However, these assertions are not supported by Navigator 4.0x or Internet Explorer 4.0.

The $ assertion can match at the end of the string, or one character earlier than that, if the last character of the string is a new-line.

Rule 5

A quantified atom matches only if the atom itself matches some number of times allowed by the following quantifier. Multiple matches must be adjacent within the string. If no match can be found at a current position for any allowed quantity of the atom, the Engine backtracks to Rule 3 and tries higher-pecking-order items with different choices.

In JavaScript's current implementation of regular expressions, when using a flexible quantifier where the number of matches is not exact, you cannot instruct the Engine to start with the minimal number of matches and work up. The Engine always starts with the maximal number of matches and works down. The quantifiers try the biggest quantity first ("greedy matching"). The Engine first counts up to find out how many atoms it's possible to match in a row in the current string. When a choice fails, the Engine backtracks to the previous choice (the shorter one). Take a look at the following regexp:


In this case the Engine tries to match the maximum number of any character (. represents "any character") until the end of the line, before it tries looking for "abc". When "abc" doesn't match there (and it can't, because there's not enough room for it at the end of the string), the Engine backs off one character at a time until it finds "abc". Once it finds "abc", it stops looking for a possible shorter match. Here's the complete table of quantifiers:

{m,n}Must occur at least m times, but not more than n times.
{n,}Must occur at least n times.
{n}Must occur exactly n times.
*Must occur 0 or more times (same as {0,}).
+Must occur 1 or more times (same as {1,}).
?Must occur 0 or 1 time (same as {0,1}).

Rule 6

Each atom matches according to its type. If the atom doesn't match, or doesn't allow a match of the rest of the regular expression, the Engine backtracks to Rule 5 and tries the next choice for the atom's quantity.

Attoms match according to the following types:

  • A regexp in parentheses, (...), matches whatever the regular expression, ..., matches under Rule 2.
  • A "." matches any character, so .*, for example, matches any number of don't-care characters.
  • A list of characters in square brackets (called a character class) matches any one of the characters in the list. A caret (^) at the beginning of the list causes it to match only characters that are not in the list. The order of characters in the list doesn't matter. You may use a hyphen as a range delimiter, as in [a-z0-9], which is equivalent to \w (keep reading to find out more about this character). However, if you want to match a caret, don't put it first. If you want to match a right bracket (]), don't put it last. Another option is to backslash these characters. Most characters loose their meta-ness inside square brackets. Since each character in the brackets is interpreted individually, it is meaningless to specify alternation with the vertical bar. A vertical bar is simply a vertical bar in a character class. So [fee|fie|foe] means the same as [feio|]. The following regular expression matches a string consisting of at least one character that is not a space (and no spaces):

    /$[^ ]^/
    In this example the first caret matches the beginning of the string, whereas the second one indicates that the character class matches any character but those inside it (a space in this case).
  • A backslashed character matches a special character or a character class (more than one character). Here's a list of special characters:

    \rCarriage return
    \vVertical tab
    \dA digit (same as [0-9])
    \DA non-digit (same as [^0-9])
    \wA word (alphanumeric) character (same as [a-zA-Z_0-9])
    \WA non-word character (same as [^a-zA-Z_0-9])
    \sA whitespace character (same as [ \t\v\n\r\f])
    \SA non-whitespace character (same as [^ \t\v\n\r\f])
  • \#, where # is an integer, is a backreference to the last substring matching the # parenthetical in the regular expression (counting left parentheses). The following regular expression matches "Bart, Lisa," in the string "Bart, Lisa, Maggie":

    Note that paired parentheses are numbered by counting left parentheses from the left.
  • A backslashed two- or three-digit octal number such as \033 matches the character with the specified value, unless it can be interpreted as a backreference. Priority is given to backreferences in this case.
  • A backslashed x followed by one or two hexadecimal digits ([0-9a-fA-F]{1,2}), such as \x7f, matches the character with that hex value.
  • Any other backslashed character matches itself, and so does any character not mentioned above.

Created: October 23, 1997, 1997
Revised: December 4, 1997