XQuery from the Experts: Influences on the design of XQuery
The
need for an XML Query language
Early in its history, the XML Query Working Group confronted the question of
whether XML is sufficiently different from other data formats to require a query
language of its own. The SQL language is a very well established standard for
retrieving information from relational databases and has recently been enhanced
with new facilities called "structured types" that support nested structures
similar to the nesting of elements in XML. If SQL could be further extended
to meet XML query requirements, developers could leverage their considerable
investment in SQL implementations, and users could apply the features of these
robust and mature systems to their XML databases without learning a completely
new language.
Given these incentives, the working
group conducted a study of the differences between XML data and relational data
from the point of view of a query language. Some of the significant differences
between the two data models are summarized below.
Relational data is "flat" --
that is, organized in the form of a two-dimensional array of rows and columns.
In contrast, XML data is "nested", and its depth of nesting can be irregular
and unpredictable. Relational databases can represent nested data structures
by using structured types or tables with foreign keys but it is difficult
to search these structures for objects at an unknown depth of nesting. In
XML, on the other hand, it is very natural to search for objects whose position
in a document hierarchy is unknown. An example of such a query might be "Find
all the red things", represented in the XPath language by the expression //*[@color
= "Red"]. This query would be much more difficult to represent in a
relational query language.
Relational data is regular and
homogeneous. Every row of a table has the same columns, with the same names
and types. This allows metadata -- information that describes the
structure of the data -- to be removed from the data itself and stored in
a separate catalog. XML data, on the other hand, is irregular and heterogeneous.
Each instance of a Web page or a book chapter can have a different structure
and must therefore describe its own structure. As a result, the ratio of metadata
to data is much higher in XML than in a relational database, and in XML the
metadata is distributed throughout the data in the form of tags rather than
being separated from the data. In XML, it is natural to ask queries that span
both data and metadata, such as “What kinds of things in the 2002 inventory
have color attributes," represented in XPath by the expression /inventory[@year
= "2002"]/*[@color]. In a relational language, such a query would require
a join that might span several data tables and system catalog tables.
Like a stored table, the result
of a relational query is flat, regular, and homogeneous. The result of an
XML query, on the other hand, has none of these properties. For example, the
result of the query “Find all the red things" may contain a cherry, a flag,
and a stop sign, each with a different internal structure. In general, the
result of an expression in an XML query may consist of a heterogeneous sequence
of elements, attributes, and primitive values, all of mixed type. This set
of objects might then serve as an intermediate result used in the processing
of a higher-level expression. The heterogeneous nature of XML data conflicts
with the SQL assumption that every expression inside a query returns an array
of rows and columns. It also requires a query language to provide constructors
that are capable of creating complex nested structures on the fly -- a facility
that is not needed in a relational language.
Because of its regular structure,
relational data is “dense” -- that is, every row has a value in every
column. This gave rise to the need for a “null value” to represent unknown
or inapplicable values in relational databases. XML data, on the other hand,
may be “sparse.” Since all the elements of a given type need not have
the same structure, information that is unknown or inapplicable can simply
not appear. This gives an XML query language additional degrees of freedom
for dealing with missing data.
In a relational database, the
rows of a table are not considered to have an ordering other than the orderings
that can be derived from their values. XML documents, on the other hand, have
an intrinsic order that can be important to their meaning and cannot be derived
from data values. This has several implications for the design of a query
language. It means that queries must at least provide an option in which the
original order of elements is preserved in the query result. It means that
facilities are needed to search for objects on the basis of their order, as
in “Find the fifth red object” or “Find objects that occur after this
one and before that one.” It also means that we need facilities to impose
an order on sequences of objects, possibly at several levels of a hierarchy.
The importance of order in XML contrasts sharply with the absence of intrinsic
order in the relational data model.
The significant data model differences
summarized above led the working group to decide that the objectives of XML
queries could best be served by designing a new query language rather than by
extending a relational language. Designing a query language for XML, however,
is not a small task, precisely because of the complexity of XML data. An XML
“value,” computed by a query expression, may consist of zero, one, or many
items, each of which may be an element, an attribute, or a primitive value.
Therefore, each operator in an XML query language must be well defined for all
these possible inputs. The result is likely to be a language with a more complex
semantic definition than that of a relational language such as SQL.
Basic
principles
The XML Query Working Group did not draw up a formal list of the principles
that guided the design of XQuery. Nevertheless, throughout the design process,
a reasonably stable consensus existed in the working group about at least some
of the principles that should underlie the design of an XML query language.
Some of these principles were mandated by the charter of the working group,
and others arose from strongly held convictions of its members. The following
list is my own attempt to enumerate the basic ideas and principles that were
most influential in shaping the XQuery language. Tension exists among some of
these principles, and several design decisions were the result of an attempt
to find a reasonable compromise among conflicting principles.
Compositionality:
Perhaps the longest-standing principle in the design of XQuery is that XQuery
should be a functional language incorporating the principle of compositionality.
This means that XQuery consists of several kinds of expressions, such as path
expressions, conditional expressions, and element constructors, that can be
composed with full generality. The result of any expression can be used as
the operand of another expression. No syntactic constraints are imposed on
the ways in which expressions can be composed (though the language does have
some semantic constraints). Each expression returns a value that depends
only on the operands of the expression, and no expression has any side effects.
The value returned by the outermost expression in a query is the result of
the query.
Closure: XQuery
is defined as a transformation on a data model called the Query data model.
The input and output of every query or subexpression within a query each form
an instance of the Query data model. This is what is meant by the statement
that XQuery is closed under the Query data model. The working group
spent considerable time on the definition of the Query data model and on how
instances of this model can be constructed from input XML documents and/or
serialized in the form of output XML documents.
Schema conformance:
Since XML Schema has recently been adopted as a W3C Recommendation, the working
group considered it highly desirable for XQuery to be based on the type system
of XML Schema. This constraint strongly influenced the design of XQuery by
providing a set of primitive types, a type-definition facility, and
an inheritance mechanism. The validation process defined by XML Schema
also strongly influenced the XQuery facilities for constructing new elements
and assigning their types. Nevertheless, members of the working group attempted
to modularize the parts of the language that are related to type definition
and validation, so that XQuery could potentially be used with an alternative
schema language at some future time.
XPath compatibility:
Because of the widespread usage of XPath in the XML community, a strong effort
was made to maintain compatibility between XQuery and XPath Version 1.0. Despite
the importance of this goal, it was necessary in a few areas to compromise
compatibility in order to conform to the type system of XML Schema, because
the design of XPath Version 1.0 was based on a much simpler type system.
Simplicity: Many
members of the working group considered simplicity of expression and ease
of understanding to be primary goals of our language design. These goals were
often in conflict with other goals, resulting in some painful compromises.
Completeness: The
working group attempted to design a language that would be complete enough
to express a broad range of queries. The existence of a well-motivated use
case was considered a strong argument for inclusion of a language feature.
The expressive power of XQuery is comparable to the criterion of “relational
completeness" defined for database query languages, though no such formal
standard has been defined for an XML data model. Informally, XQuery is designed
to be able to construct any XML document that can be computed from input XML
documents using the power of the first-order predicate calculus. In addition,
recursive functions add significant expressive power to the language.
Generality: XQuery
is intended for use in many different environments and with many kinds of
input documents. The language should be applicable to documents that are described
by a schema, or by a Document Type Definition, or by neither. It should be
usable in strongly typed environments where input and output types are well
known and rigorously enforced, as well as in more dynamic environments where
input and output types may be discovered at execution time and some data may
be untyped. It should accommodate input documents from a variety of sources,
including XML files discovered on the Web, repositories of pre-validated XML
documents, streaming data sources such as stock tickers, and XML data synthesized
from databases.
Conciseness: In
the interest of conciseness, the semantics of the XQuery operators were defined
to include certain implicit operations. For example, arithmetic operators
such as +, when applied to an element, automatically extract
the numeric value of the element. Similarly, comparison operators such as
=, when applied to sequences of values, automatically iterate
over the sequences, looking for a pair of values that satisfies the comparison
(this process is called existential quantification). These implicit
operations are consistent with XPath Version 1.0 and were preferred over a
design that would require each operation to be explicitly specified by the
user.
Static analysis:
From the beginning, the processing of a query was assumed to consist of two
phases, called query analysis and query evaluation (roughly
corresponding to compilation and execution of a program). The analysis phase
was viewed as an opportunity to perform optimization and to detect
certain kinds of errors. A great deal of effort went into defining the kinds
of checks that could be performed during the analysis phase and in deciding
which of these checks should be required and which should be permitted.
About the
author
Don Chamberlin is one of IBM's representatives in the W3C XML Query Working
Group. He is also a coauthor of the Quilt language proposal, which formed the
basis for the XQuery design. Don is best known as co-inventor of the SQL database
language and as author of two books on the DB2 database system. He holds a Ph.D.
from Stanford University and is a staff member at IBM's Almaden Research Center.
He is an ACM Fellow and a member of the National Academy of Engineering. Don
is an editor of the working drafts of XML Use Cases and XQuery 1.0
and XPath 2.0 Data Model.