The XML Schema Companion | WebReference

The XML Schema Companion

The XML Schema Companion. Chapter 15: Patterns

Reproduced from Neil Bradley's The XML Schema Companion by permission of Addison-Wesley. ISBN 0321136179, copyright 2004. All rights reserved. See http://www.awprofessional.com/titles/0321136179 for more information.

The ‘pattern' facet requires more explanation than the brief description given in Section 14.6 provides. This XML feature is based on the regular expression capa­bilities of the Perl programming language. It is therefore very powerful, but this strength comes at the cost of some complexity.

15.1 Introduction

Although the XML Schema language has a large number of built-in data types that can be used, restricted, and extended, some requirements demand much finer con­trol over the exact structure of a value. For example, a simple code might need to consist of three lowercase letters:

<Code>abc</Code>

<!-- OK -->

<Code>ABC</Code>

<!-- ERROR -->

<Code>abcd</Code>

<!-- ERROR -->

Similarly, when an element or attribute contains an ISBN (International Standard Book Number), it should be possible to apply constraints that reflect the nature of ISBN codes. All ISBN codes are composed of three identifiers (location, pub­lisher, and book) and a check digit, separated by hyphens (or spaces). Valid values would include ‘0-201-41999-8' and ‘963-9131-21-0'. The schema processor should detect any error in an ISBN attribute:

 

<Book ISBN="0-201-77059-8" ...>

<!-- OK -->

<Book ISBN="X-999999-" ...>

<!-- ERRORS -->

Some programming languages, such as Perl, include a regular expression lan­guage, which defines a pattern against which a series of characters can be com­pared. Typically, this feature is used to search for fragments of a text document, but the XML Schema language has co-opted it for sophisticated validation of ele­ment content and attribute values.

15.2 Simple Templates

The pattern facet element holds a pattern in its value attribute. The simplest pos­sible form of pattern involves a series of characters that must be present, in the order specified, in each element or attribute declaration that uses the data type con­strained by the pattern facet.

The pattern ‘abc' might be specified as the fixed value of a Code element:

<Code>abc</Code>

The pattern ‘0-201-41999-8' might be specified as the fixed value of an ISBN attribute:

 <Book ISBN="0-201-41999-8" ... >

 In this simple form, a pattern is similar to an enumeration, except that in the case of patterns the match must be exact, regardless of the data type used (recall that Section 14.6 explains how patterns differ from enumerations in this respect).

Although specifying an exact sequence of characters is among the simplest things that can be achieved with the pattern language, specifying a sequence of characters that must not appear in a value is much harder.

It is often a good idea to use the ‘normalized' or ‘token' data type as the base data type for the restriction when the presence of surrounding whitespace should not be allowed to trigger an error.

Just as a restriction element can contain multiple enumeration elements, it can also contain multiple pattern elements. The element content or attribute value is valid if it matches any of the patterns:

<restriction base="token">
<pattern value="abc" />
<pattern value="xyz" />
</restriction>

<Code>abc</Code>
<Code>xyz</Code>

<!-- OK --><!-- OK -->

 

<Code> abc </Code>

<Code>acb</Code>

<!-- OK -->

<!-- ERROR

-->

<Code>xzy</Code>

<!-- ERROR

-->

<Code>abcc</Code>

<!-- ERROR

-->

Alternatively, a single pattern can contain multiple ‘branches'. Each branch is actually a distinct, alternative expression, separated by the ‘|' symbol from previ­ous or following branches. Again, the pattern test succeeds if any one of the branches matches the pattern (the ‘|' symbol is therefore performing a function similar to its use in DTD content models). The following example is equivalent to the multipattern example above:

<restriction base="string">
<pattern value="abc|xyz" />
</restriction>

Note that, although branches are never essential at this level, because multiple pattern elements can be used instead, they are the only technique available in another circumstance discussed later (involving subexpressions).


Created: March 27, 2003
Revised: January 1, 2004

URL: http://webreference.com/programming/awxml1