|
|
 |
|
| |
HTML Unleashed: SGML and the HTML DTD |
Document Type Definition for HTML 4.0
|
| |
ow
that we've examined the SGML declaration and found answers to a
number of general questions about HTML formation, it's time to get to
the details of its tags, entities, and the related document structure.
All of this is defined in the document
type definition (DTD) for HTML 4.0.
The HTML DTD analyzed in this chapter is too long to be listed in its
entirety. Instead of going through the DTD from top to bottom, I
discuss the major concepts and syntax features of an SGML DTD in their
logical order exemplifying them by excerpts from the HTML 4.0 DTD.
This approach will enable you to understand any given part of the DTD
without the chapter being too encumbered.
| |
| |
Before we start investigating elements that form an HTML document and
the tags that delimit these elements, let's discuss another SGML
concept named entities. If tags can be likened to named styles
in word processors, then entities are a direct analog of macros
that may expand to text strings or markup instructions.
In HTML documents, entities are used to invoke characters that either
are absent on a computer keyboard (such as é) or
have special meaning and thus cannot be typed directly (such as
<). In the DTD itself, as you'll see later, entities
play a more important role helping to make all sorts of declarations
more concise and readable. The entities used in DTD are called
parameter entities, as opposed to general entities
intended for use in HTML documents and not in DTD. These two types of
entities are declared in a slightly different manner, as shown in the
next three sections.
| |
| |
The very first declaration in the HTML 4.0 DTD is an entity that
expands into a formal reference (in this case, a URL) of the DTD:
| |
| |
<!ENTITY % HTML.Version
"http://www.w3.org/pub/WWW/MarkUp/Cougar/Cougar.dtd"
-- Typical usage:
<!DOCTYPE HTML SYSTEM
"http://www.w3.org/pub/WWW/MarkUp/Cougar/Cougar.dtd">
<html>
...
</html>
--
>
| |
| |
Let's consider, in this example, the syntax of an entity declaration.
It uses the ENTITY statement that, like all other SGML
statements, requires a ! after the start delimiter
<. After the ENTITY keyword comes the
% character indicating that the entity in question is a
parameter entity rather than a general entity.
Separated from % by one or more spaces is the entity name
that is later used to invoke the entity. Note that the name contains a
period, thus making use of the NAMING section settings in
the SGML declaration. Also recollect that entity names are
different from element names in that they are case sensitive.
The last obligatory component of an entity declaration is the string
enclosed in quotation marks (data string) that shows what this
entity stands for and what it will expand to when invoked. Here's how
the entity we have defined can be used later in the DTD:
%HTML.Version;
Note that this time, there is no space between the % and
the entity name. The trailing semicolon may be omitted in certain
contexts.
Unfortunately, this part of SGML syntax clashes with one of Netscape
HTML extensions, namely using the % character for
specifying sizes of images and other elements as percentages of window
dimensions. This is why HTML validators that check a HTML document
against a DTD sometimes have trouble with this feature.
| |
| |
The last part of the %HTML.Version; entity declaration is
the comment that reminds us about the necessity (unambiguously stated
in HTML specification) to start any HTML document that is intended to
be a valid SGML document with a DOCTYPE declaration. This
allows an SGML parser to know at once that the structure and tags of
the document it's about to process are described in the DTD identified
(in our case) by its URL. Of
course, HTML (not SGML) browsers could also make use of this
information to select the level of HTML support needed for the
document (although only a few of them really do).
The use of a URL as a DTD identifier is rather unusual (it is probably
explained by the fact that, at the time of this writing, HTML 4.0 DTD
was still evolving). More often, to refer to external information
sources, SGML documents use public identifiers of a special
form. For example, in HTML 3.2 DTD the %HTML.Version;
entity expands into the string -//W3C//DTD HTML 3.2
Final//EN which is the public identifier of this version's DTD.
Another example is the identifier string of a character set standard
used for the BASESET parameter in SGML declaration (see
"CHARSET Section," earlier
in this chapter). Any DTD or related standard has a unique
public identifier assigned in order to allow referring to this
standard from other SGML documents. Such references are
usually made via parameter entities.
If the data string in an entity declaration is preceded by the
additional keyword PUBLIC, this means that the string is
not the entity value but a public identifier pointing to an external
information source. For example, the HTML 4.0 DTD is accompanied by a
set of general entities for accessing characters of ISO Latin-1
isolated in a separate
document with its unique public identifier. Here's how
this document is incorporated into HTML DTD:
<!ENTITY % HTMLlat1 PUBLIC
"-//W3C//ENTITIES Latin1//EN//HTML">
Here, the string in quotes contains the public identifier of the
external resource whose contents will be substituted for each
occurrence of the entity %HTMLlat1;. To make the mentioned
document part of the DTD, it is now enough to invoke the defined
entity (actually this is done right after its declaration):
%HTMLlat1;
Formal rules for constructing public identifiers need not be detailed
here. There exists a fairly complete
catalog
of public identifiers.
| |
| |
General entities are declared in the DTD similarly to parameter
entities, but they have a number of differences:
For an example, consider the entity declarations provided in the DTD
for accessing four special characters:
<!ENTITY amp CDATA "&" -- ampersand -->
<!ENTITY gt CDATA ">" -- greater than -->
<!ENTITY lt CDATA "<" -- less than -->
<!ENTITY quot CDATA """ -- double quote -->
This example also shows us one special kind of entity called
character reference that does not require any declaration. If
the entity opening delimiter & is immediately followed
by the # character and a number, this number is interpreted
as character code (from the document character set as defined in the
SGML declaration, see "CHARSET
Section") and the whole entity is replaced by the character
having this code. This is one of the two methods to access
characters that are beyond the reach of a computer keyboard; the
other method uses the mnemonic character entities defined in
the DTD, such as & or é.
You might wonder how the entities in the preceding example could
expand to special characters if the CDATA keyword prohibits
any SGML instructions, character references included, from having
effect in the data string. The answer is that this string is in fact
read twice: the first time when the entity declaration is interpreted,
and the second time when the entity is used in the document and its
data string is substituted. The CDATA keyword affects only
the first reading. As a result, the DTD is protected from the special
characters, while in the document the references are expanded to the
characters intended.
| |
| |
As I've already mentioned in "How to Define an SGML
Application," a document marked up with an
SGML application is thought of as consisting of a hierarchy of
nested elements. A marked up element is usually enclosed in a
pair of start and end tags. The ELEMENT statement
in SGML defines both start and end tags (but not their attributes)
and prescribes what may be the content of this element by defining
its content model.
Here's an example of element declaration:
<!ELEMENT P - O (%text)*>
Here, P is the element name (short for Paragraph). The two
characters following the element name are minimization
indicators specifying whether it is possible to omit start and/or
end tags for this element. The first indicator refers to the start
tag, and the second, to the end tag.
In place of a minimization indicator, you can put either a hyphen
(-), meaning that the tag is obligatory, or the letter
O, meaning that the tag is omissible. Thus, the preceding
statement declares that a P element (a paragraph) must be
preceded by the <P> start tag, while the
</P> end tag can be omitted.
It is possible to have both start and end tags omissible. For example,
the declaration
<!ELEMENT HTML O O (%html.content)>
indicates that both <HTML> and
</HTML> tags around the content of an HTML document
can be dropped.
| |
| |
The last component in the element declarations above is
the content model specification. (Here, it is done via
parameter entities, and to see what they expand to we should find the
corresponding ENTITY statements in the DTD.) A content model
declares what can, what must, and what must not go inside the element.
The simplest type of content model is specified by a single keyword
from the following list:
- CDATA
- Stands for Character DATA. This
keyword means that the SGML parser suspends its processing for the
content of the element. Whatever other tags or entities are
contained in the element, they won't have any effect and will be
treated as ordinary data characters. The only tag that SGML
parser reacts to when skipping over CDATA content is the
end tag of the element that switched to CDATA mode.
HTML DTD uses CDATA content model for the obsolete
elements XMP, LISTING and PLAINTEXT
that were intended for inserting preformatted text into HTML
document without the need to escape any special characters.
Also, the CDATA mode is used for STYLE and
SCRIPT elements whose content is to be processed by
external programs rather than SGML parser.
- RCDATA
- Stands for Replaceable Character
DATA. This keyword introduces content model that is only
different from CDATA in that it expands all general
entities and character references, but ignores markup
statements. RCDATA is not used in HTML DTD.
- EMPTY
- Means that the content of the element is
empty. Naturally, this is always accompanied by the permission
to omit the end tag. For example:
<!ELEMENT IMG - O EMPTY -- Embedded image -->
- ANY
- Allows any markup and data characters
within the element. ANY is not used in
HTML DTD.
| |
| |
Sometimes, however, it is necessary to be more specific in defining
content model of an element. This is done via content model
groups whose syntax deserves a more thorough examination.
The simplest model group is one element name enclosed in parentheses,
which means that the element being defined must contain one occurrence
of the element specified in content model and nothing else. This is a
rather artificial situation, as more often a model group contains two
or more element names---for example,
<!ELEMENT HTML O O (HEAD, BODY)>
Here, the comma between HEAD and BODY is a
connector used to indicate the relations between the elements
listed. Possible connectors include the following:
- A comma (,) indicates that the elements listed in
the content model should both be present within the element
exactly in the order specified.
- A vertical bar (|) is the "exclusive or" connector.
It indicates that one and only one of the elements can occur.
However, it is often more practical to use the "simple or"
relation allowing any one, or both, or even none of the
elements to be present. This is why | is often
combined with the occurrence indicator *, for
example:
<!ELEMENT APPLET - - (PARAM | %text)*>
Here the content model specification says that within the
APPLET element, any number of PARAM
elements mixed with any number of text fragments (this is what
the %text; entity effectively expands to) may occur.
- An ampersand (&) is the "and" connector. It
indicates that all of the elements listed must occur, but in
any order. It is often combined with the ?
occurrence indicator. Here's how the DTD defines the
%head.content; parameter entity that is later used
in content model specification for the HEAD element:
| |
| |
<!ENTITY % head.content "TITLE & ISINDEX? & BASE?">
| |
| |
Here's the list of occurrence indicators used to show how many times
the elements can occur in a content model:
Model groups can be nested, and the occurrence indicators may apply to
an entire group rather than a single element:
<!ELEMENT DL - - (DT|DD)+>
This means that within a DL (Definition List) element, at
least one (but possibly more) DT or DD elements
must be present.
Besides element names, you can use the #PCDATA (Parsed
Character DATA) keyword in model groups. It refers to "usual"
characters of the document without any markup tags and can be used to
explicitly allow or disallow plain text within an element.
It is different, however, from the CDATA keyword discussed
earlier. First, #PCDATA can be used only within a model
group and not on its own as CDATA (that is,
#PCDATA should be enclosed in parentheses even when it
stands alone). And second, #PCDATA does not imply ignoring
markup; if a tag is encountered in the context where only
#PCDATA is allowed, a compliant SGML parser should fix an
error rather than ignore this tag.
Together with the connectors and occurrence indicators listed,
#PCDATA can limit the set of elements allowed inside
another element without prohibiting plain text from appearing there.
For example, here's how the %text; entity is defined via a
number of subordinate classifying entities:
| |
| |
<!ENTITY % font "TT | I | B | U | S | BIG | SMALL | SUB | SUP">
<!ENTITY % phrase "EM | STRONG | DFN | CODE | SAMP | KBD | VAR | CITE">
<!ENTITY % special
"A | IMG | APPLET | OBJECT | FONT | BASEFONT | BR | SCRIPT |
MAP | Q | SPAN | INS | DEL | BDO | IFRAME">
<!ENTITY % formctrl "INPUT | SELECT | TEXTAREA | LABEL | BUTTON">
<!ENTITY % text "#PCDATA | %font | %phrase | %special | %formctrl">
| |
| |
Thus the %text; entity stands for, in plain English,
"either a chunk of text or one of all these listed elements."
Obviously, it'll most often be used with the * occurrence
indicator. For an example, see how the preceding declarations are used
once more to define quite a number of elements in one snap:
<!ELEMENT (%font|%phrase) - - (%text)*>
As you see here, both parameter entities and groups can be used for
specifying element names in declarations, not only in their content
models.
SGML syntax also allows notation of the addition or subtraction of
model groups, which is very convenient if these groups are specified
via entity references. For instance, the FORM element is
allowed to contain anything that can occur within a block-level
element (that is, an element that starts a new paragraph) except for
the FORM element itself (that is, FORMs cannot
be nested). Rather than define the new content group from scratch, we
can make use of the already defined %block.content; entity
by subtracting the single FORM element from it:
<!ELEMENT FORM - - %block.content
-(FORM)>
Analogously, we can sum up two model groups:
<!ELEMENT HEAD O O (%head.content)
+(%head.misc)>
| |
| |
An element is not fully described by its name and content model. Many
elements have associated attributes that serve to provide
additional information for rendering the element. Attributes for each
element should be declared in the DTD via ATTLIST
statements.
Here's a typical attribute declaration for an element:
| |
| |
<!ATTLIST AREA
shape %SHAPE rect -- controls interpretation of coords --
coords %COORDS #IMPLIED -- comma separated list of values --
href %URL #IMPLIED -- this region acts as hypertext link --
target CDATA #IMPLIED -- where to render linked resource --
nohref (nohref) #IMPLIED -- this region has no action --
alt CDATA #REQUIRED -- description for text only browsers --
tabindex NUMBER #IMPLIED -- position in tabbing order --
onClick %script #IMPLIED -- intrinsic event --
onMouseOver %script #IMPLIED -- intrinsic event --
onMouseOut %script #IMPLIED -- intrinsic event --
>
| |
| |
Right after the ATTLIST keyword, the name of the element
for which we're defining attributes is specified. Next comes a number
of three-component groups, each defining one attribute. The first
identifier in each group is the attribute name. The other two specify
the type of value for the attribute and its default value, as detailed
in the next sections.
| |
| |
After the name of each attribute in the ATTLIST declaration
comes a keyword describing its type. This keyword is usually taken
from the following list:
- CDATA
- Here again, CDATA means that the value of this attribute may be any string of characters (as well as an empty string) and should be ignored by the parser. CDATA is
used in situations where it is impossible to force
more strict limitations on the attribute value with
one of the following keywords.
- NAME
- This keyword indicates that the value of the attribute is a name conforming to SGML naming rules as defined by the SGML declaration.
(See "Naming Rules Declaration," earlier in this chapter.)
The following fragment of an ATTLIST
declaration is an example:
| |
| |
<!ATTLIST META
...
http-equiv NAME #IMPLIED -- HTTP response header name --
name NAME #IMPLIED -- metainformation name --
...
>
| |
| |
- NMTOKEN
- This keyword is similar to NAME with the exception that there's no requirement to start the name with the name start character.
(See "Naming Rules Declaration," earlier in this
chapter.) This keyword is not used in HTML 4.0 DTD.
- NUMBER
- This keyword allows the parameter to take numeric values. The following ATTLIST fragment is an example:
| |
| |
<!ATTLIST OL -- ordered lists --
...
compact (compact) #IMPLIED -- reduced interitem spacing --
start NUMBER #IMPLIED -- starting sequence number --
...
>
| |
| |
- ID
- This keyword indicates that the attribute
value is an identifier satisfying two
requirements: first, it is a valid SGML name (as in
the case of NAME), and second, it is
unique across the document (that is, it cannot be
assigned to any other attribute within the same
document). This value type is specified for the
ID attribute of the style sheets mechanism
applicable to the majority of HTML elements.
Besides these keywords, you can specify the list of possible values
directly using the group notation that you've already seen applied for
model groups in this chapter. Thus, in the preceding
ATTLIST declaration for the OL element, the
COMPACT attribute may only take as value the character
string "compact" or have no value at all, as in the example
<OL START=1 COMPACT>
which is equivalent to
<OL START=1 COMPACT=COMPACT>
Here's an example from the DTD with an attribute taking one of three
possible values:
<!ATTLIST table
...
align (left|center|right) #IMPLIED
...
>
| |
| |
Default Value Specification |
| |
Finally, for each attribute in an ATTLIST declaration,
either a default value is provided or a keyword is specified
indicating whether this attribute is changeable and/or required. In
this position, character strings need not be enclosed in parentheses
(although they should be put in quotes if they contain spaces or
delimiters), but the keywords require using a # escape
character as in the #PCDATA keyword mentioned earlier.
Here's a part of ATTLIST for TH and
TD elements showing default values for ROWSPAN
and COLSPAN attributes:
| |
| |
<!ATTLIST (th|td) -- header or data cell --
...
rowspan NUMBER 1 -- number of rows spanned by cell --
colspan NUMBER 1 -- number of cols spanned by cell --
...
>
| |
| |
More often, however, you'll see in place of the default value a
keyword from the following list:
- #FIXED
- This keyword must precede the actual default value and is used to specify that the value cannot be changed by the user. It is used by the DTD only once, in the declaration for
VERSION attribute of the HTML element:
<!ATTLIST HTML
VERSION CDATA #FIXED "%HTML.Version;"
...
>
This means that the only possible value of the
VERSION attribute is the string
substituted for the %HTML.Version;
parameter entity. (See "Parameter
Entities," earlier in this chapter).
- #IMPLIED
- This keyword indicates that the
attribute is optional.
- #REQUIRED
- This keyword indicates that the
attribute is obligatory. For example:
| |
| |
<!ATTLIST PARAM
name CDATA #REQUIRED -- property name --
value CDATA #IMPLIED -- property value --
...
>
| |
| |
Sometimes, a part of the DTD must be processed in a different way than
the rest of it. For this, SGML offers the generic mechanism of
marked sections that make it possible to isolate any markup
statements and declarations in order to control their processing. HTML
DTD uses this mechanism to mark its deprecated features that
should be avoided in documents but are kept in the DTD for backwards
compatibility. Here's what a marked section looks like:
<![ %HTML.Deprecated [
<!ENTITY % preformatted "PRE | XMP | LISTING">
]]>
The %HTML.Deprecated; entity expands into the special
keyword that tells the parser what to do with the contents of the
section. The two keywords used in various HTML DTDs are
IGNORE and INCLUDE. The IGNORE
keyword allows to ignore the marked section completely, and the
INCLUDE keyword prescribes to process its contents on equal
terms with the rest of DTD. So, to get a "strict" version of a DTD,
all you need to do is to change the declaration
<!ENTITY % HTML.Deprecated "INCLUDE">
to
<!ENTITY % HTML.Deprecated "IGNORE">
| |
       
 |
|