|
|
 |
|
| |
HTML Unleashed: The Emergence of XML |
XML DTDs and Valid XML Documents
|
| |
lthough
in many cases well-formed XML documents are sufficient for
practical purposes, designing a DTD for your document has a number of
advantages:
- First and foremost, a DTD allows an XML parser to
validate your document (that is why such documents are
called valid). When validating, the parser checks for
misspelled tags or attributes, for errors in types of attribute
values and in elements' content models, and so on. For HTML, similar
validation services exist that will check your file against one of
the existing HTML DTDs.
- For human reader, a DTD is a convenient way to quickly learn
the structure of the particular type of documents. Compared to
SGML, the simplified DTD syntax of XML is very straightforward
and unambiguous.
- With DTD, you can define not only elements and their
attributes, but also entities. (See "Entity
Declarations," later in this chapter.) Similarly to macros in
word processors or #define preprocessor instructions
in C, entities can be used to abbreviate text strings and
markup instructions in an obvious and easy-to-modify manner.
Also, you can use external entities to refer to other
XML documents, DTDs, or binary data located in separate files.
| |
| |
Let's examine an example of a valid XML document, namely a play by
Shakespeare (The Tempest) marked up by Jon Bosak, one of the
authors of XML. The package
includes, besides the XML document and its DTD, a DSSSL style sheet
that contains formatting instructions for each element and a
Postscript output of a DSSSL processor that formatted the play.
Here's the very beginning of the XML document play.xml:
<?XML version="1.0"?>
<!DOCTYPE play PUBLIC
"-//Free Text Project//DTD Play//EN">
The first line here is an XML declaration, a special
instruction that is XML-specific and would be ignored by an SGML
parser. Here, the XML declaration provides information about
the version of XML standard that the document conforms to.
Next comes the DOCTYPE statement that, like its namesake in
SGML, provides the DTD for the document to be parsed. In XML,
a DTD may be in two parts: internal is contained in the
document file itself while external is referenced by its URL
or public identifier, with the
internal part taking precedence over the external one in a case of
conflict.
In our example, only the external part of DTD is present, which is
referred to by the public identifier preceded by the keyword
PUBLIC. An XML parser is supposed to be able to retrieve
the text of the DTD using its public identifier (that is, to translate
the identifier into an URL or some other sort of physical address). If
the DTD you're using is not assigned a well-known public identifier,
you should provide an URL instead of it, with the SYSTEM
keyword instead of PUBLIC. For instance:
<!DOCTYPE HTML SYSTEM
"http://www.foo.com/html3x.dtd">
Finally, to provide an internal part for a DTD, you must put it in
brackets within the DOCTYPE declaration. Such a declaration
may also contain a SYSTEM or PUBLIC external
reference, for example:
<!DOCTYPE HTML SYSTEM
"http://www.foo.com/html3x.dtd"
[
<!-- your DTD goes here -->
]>
| |
| |
The name right after the DOCTYPE keyword in the preceding
statements is the name of the root element of your document
type, the top level element that encloses all other elements. In HTML,
this element is named HTML, and in our Shakespearean
example it is named PLAY. Here's how the PLAY
element is defined in play.dtd:
<!ELEMENT PLAY (title, fm, personae,
scndescr, PLAYsubt, induct?,
prologue?, act+, epilogue?)>
You can see that the content model for this
element is quite simple and immediately translatable into human talk:
"A PLAY is formed by its TITLE, followed by the
front matter (FM), followed by the list of dramatis
PERSONAE, and so on." The question mark indicates optional
elements, and the plus sign, the elements that may occur once or more.
Note that the XML spec prescribes to drop the SGML minimization
parameters that are useless in XML, which doesn't permit tag omission
anyway.
One more excerpt from PLAY.dtd shows a hierarchical set of
related tags to mark a personage's speech:
| |
| |
<!ELEMENT speech (speaker+,
(line | stagedir | subhead)+)>
<!ELEMENT speaker (#PCDATA)>
<!ELEMENT line (stagedir | #PCDATA)+>
<!ELEMENT stagedir (#PCDATA)>
<!ELEMENT subhead (#PCDATA)>
| |
| |
Thus a SPEECH is constituted by one or more
SPEAKER elements followed by at least one of the
LINE, STAGEDIR (stage direction), or
SUBHEAD elements, in no particular order (the
"|" sign means that any one of connected particles may
occur). The #PCDATA keyword has the meaning of "any
character data without tags"; thus, the SPEAKER,
STAGEDIR, and SUBHEAD elements are allowed to
contain only text characters while a LINE may have
STAGEDIRs intermingled with text.
Note that nothing in the definition of LINE (except the
name) suggests that what the element contains is really a line of
verse. It is just implied to be so by the person who did markup and it
may be formatted as a line if an appropriate style sheet is used.
However, XML only serves as an intermediator between the author and
the formatter, and is not intended to describe the nature of data
elements that are marked up with it.
Here's a SPEECH element exemplifying these DTD provisions:
| |
| |
<SPEECH>
<SPEAKER>PROSPERO</SPEAKER>
<LINE><STAGEDIR>Aside</STAGEDIR> The Duke of Milan</LINE>
<LINE>And his more braver daughter could control thee,</LINE>
<LINE>If now 'twere fit to do't. At the first sight</LINE>
<LINE>They have changed eyes. Delicate Ariel,</LINE>
<LINE>I'll set thee free for this.</LINE>
<STAGEDIR>To FERDINAND</STAGEDIR>
<LINE>A word, good sir;</LINE>
<LINE>I fear you have done yourself some wrong: a word.</LINE>
</SPEECH>
| |
| |
Entities can be declared in a DTD as follows:
<!ENTITY me "Dmitry Kirsanov,
St.Petersburg, Russia">
In the document, such an entity can be used similarly to mnemonic
character entities of HTML:
This document was created by &me;
on Apr 21, 1997
Another syntax is used to define entities that refer to external files
or documents. For example:
<!ENTITY mypage SYSTEM
"http://www.symbol.ru/dk/index.xml">
<!ENTITY xml-logo SYSTEM
"http://www.ucc.ie/xml/xml.gif" NDATA gif>
In the second declaration, gif is the name of a
notation (similar to a data type), which must be declared
somewhere in the DTD along with information on where an XML processor
can access a helper software capable of handling data in this
notation.
Now, &mypage; and &xml-logo; entities
can be used in documents using this DTD. However, XML specification
does not prescribe the exact behavior of XML application on
encountering such an entity. For example, it may incorporate it into
the text of the current document or it may present it as a link that
the user can activate.
| |
     
 |
|