WebReference.com - Part 1 of Chapter 1: Professional XML Schemas, from Wrox Press Ltd (1/5)
Professional XML Schemas
Why do we Need Schemas?
XML is intended to be a self-describing data format, allowing authors to define a set of element and attribute names that describe the content of a document. As XML allows the author such flexibility, we need to be able to define what element and attribute names are allowed to appear in a conforming document in order to make that document useful. Furthermore, we need to be able to indicate what sort of content each of these elements and attributes are allowed to contain. Only then can people share the meaning of the markup used in an XML document, be it for human or application consumption.
Sometimes authors require flexibility in what markup they can use to describe a document's content, while at other times they may be forced to adopt a very specific structure. For example, if we were working on an application for a publishing company, we might define a set of elements such as Book, Chapter, Heading1, Heading2, Heading3, Paragraph, Table, CrossReference, and Diagram. Each Book element would be allowed to contain any number of Chapter elements, which in turn would contain Heading and Paragraph elements. The Paragraph elements may then contain text, tables, cross references and diagrams. In such a case, the people marking up the book's content need a flexible way of indicating what information is held within each element as no two books are going to have exactly the same content. By contrast, if we were writing an e-commerce system, it would be the job of an application, rather than a human, to create and process the XML documents. Each part of the process would require a different type document, one structure for catalogs, one for purchase orders, one for receipts, and so on. In such situations, rather than there being a requirement for flexibility, the application would expect a predictable, rigid structure; it would need certain pieces of information in order to fulfill any given task.
As XML becomes more widely used in applications, there is an increasing demand for support of primitive datatypes found in languages like SQL, Java, Visual Basic or C++ (the concepts of strings, dates, integers, and so on). XML Schema introduces a powerful type mechanism that not only allows us to specify primitive datatypes, but also types of structures, allowing us to integrate principles of object-oriented development such as inheritance into our schemas.
A schema defines the allowable contents of a class of XML documents. A class of documents refers to all possible permutations of structure in documents that will still conform to the rules of the schema.
Background to XML Schemas
When XML was created, it was written as a simplified form of an existing markup language, called SGML, which was used for document markup. SGML, however, was so complex that it was not widely adopted, and browser manufacturers made it clear that they were not going to support it in their products. The simpler relative, XML, became a popular alternative, and was soon adopted by all kinds of programmers, not just those involved in document markup. When XML 1.0 became a W3C recommendation, it contained a mechanism for constraining the allowable content of a class of XML document, which you are probably familiar with, in the form of Document Type Definitions or DTDs. The syntax of DTDs, however, fell short of the requirements of those who were putting XML to new uses, in particular data transfer, and as a result the W3C wanted to create an alternative schema language, namely XML Schema.
The W3C XML Schema Working Group has had the incredibly tough task of creating a schema specification that would satisfy a wide range of users, from programmers to content architects, many of whom have been waiting for XML Schema with much anticipation because they see it as a much more powerful way to define document structures. Indeed, it has been a long time in coming, and there was a gap of over two years between the working group releasing a set of requirements they aimed to achieve with the new schema language, back in February 1999, and the recommendation's release in May 2001.
In the time the W3C have taken to release the XML Schema Recommendation, a number of alternative schema technologies have been released. While this one is likely to achieve wide support because of its endorsement by the W3C, the competing technologies offer alternative approaches to constraining allowable contents of an XML document. This book mainly focuses on the W3C XML Schema Recommendation, although we do look at some of the other schema efforts in Chapter 14.
The aims of the W3C XML Schema Working Group were to create a schema language that would be more expressive than DTDs and written in XML syntax. In addition it would also allow authors to place restrictions on the allowable element content and attribute values in terms of primitive datatypes found in languages such as SQL and Java.
In terms of defining structure of documents, the aims included:
- Providing mechanisms for constraining document structures and content
- Allowing tighter or looser constraints upon classes of documents than those offered by DTDs
- The ability to validate documents composed from markup belonging to multiple namespaces
- Mechanisms to enable inheritance for element, attribute, and datatype definitions, so that they can formally represent kind-of relations (for example, a car is a kind-of vehicle)
- Mechanism for embedded documentation
In terms of offering primitive data typing, the aims included:
- Support for primitive datatypes such as byte, date, and integer, as found in languages like SQL and Java
- Definition of a type system that would support import and export of data as XML to and from relational, object and OLAP database systems
- The ability to allow users to define their own datatypes that derive from existing datatypes by constraining certain of their properties, such as range and length
The full requirements can be seen at: http://www.w3.org/TR/NOTE-xml-schema-req
The result is a powerful and flexible language for expressing permissible content of a class of XML documents. The added capabilities, however, come at a cost: the resulting language is complicated, especially when we begin to experiment with its more advanced aspects.
Created: October 18, 2001
Revised: October 18, 2001