|
|
 |
|
| |
HTML Unleashed: SGML and the HTML DTD |
SGML Declaration for HTML 4.0
|
| |
GML
declaration is a formal construct used to specify some general
information about an SGML application and its associated document
type. The following sections list and analyze the
SGML
declaration for HTML 4.0 provided by W3C.
The SGML declaration is contained in the SGML statement, which has the
following syntax:
<!SGML "ISO 8879:1986" ... >
The ellipsis here represents the body of SGML declaration, and the
string ISO 8879:1986 is meant to denote the level of
SGML standard that this declaration conforms to. In our case, this is
the original ISO specification published in 1986.
In the body of the declaration, first comes the comment part:
| |
| |
--
SGML Declaration for HyperText Markup Language
version 4.0
With support for Unicode UCS-2 and increased limits
for tag and literal lengths etc.
--
| |
| |
The rest of the declaration body is divided into sections that are
described next.
| |
| |
The CHARSET section of SGML declaration is used to specify
the character set to be used by the documents conforming to
this document type. So what is a character set?
You probably know that the characters that appear on your display are
coded inside the computer by some bit combinations, usually
bytes consisting of eight bits. Unfortunately, different
computers and operating systems sometimes use the same bytes to
represent different characters on the screen. The most frequent reason
for this is that localized versions of programs and operating systems
need to represent non-Latin characters of a particular language's
alphabet (such as Cyrillic alphabet for Russian).
Thus, to make the SGML document as unambiguous as possible, SGML
declaration defines exactly what character set it uses, that
is, what bit combinations (codes) are allowed within a conforming
document and what characters they are intended to mean. To define a
character set, you need to specify three things: first, the set of
codes used; second, the set of characters represented; and third, the
mapping between these two sets.
The set of codes is easy to specify by simply listing these codes in
decimal or hexadecimal form. The set of characters, or character
repertoire, is more tricky. You cannot simply "draw" a character
in the specification because the SGML declaration itself is
represented by a plain text file where every character is coded by a
bit combination not guaranteed to mean the same on all systems. One
possible way to overcome this difficulty is to give a textual
description for every character in the repertoire (for example,
"CYRILLIC CAPITAL LETTER A").
However, SGML creators have chosen a less complicated way of dealing
with the problem. SGML declaration makes use of other character set
standards that have already been adopted by standard-setting bodies
(mostly ISO) and that can provide us with a full specification of
nearly any character in the world. Having made a reference to such a
standard, you can then use character numbers in that standard to
clearly identify what character you need for your document's character
set. Here's how this is done.
First comes the CHARSET keyword that marks the beginning of
the corresponding section. It is followed by the BASESET
keyword that contains the name of the character set standard referred
to thereupon:
CHARSET
BASESET "ISO 646:1983//CHARSET
International Reference Version
(IRV)//ESC 2/5 4/0"
The standard specified here, commonly referred to as "ISO 646," is
practically indistinguishable from what is called "7-bit ASCII." Its
128 characters cover all Latin alphabet characters, digits,
punctuation, and some special characters. It is the greatest common
subset for nearly all character sets in use now, and you're unlikely
to find a computer or a program (even a localized version) that uses
something other than ISO 646 for its first 128 byte codes.
However, SGML declaration for HTML 4.0 does not use all this character
set, but only a certain part of it. The selection is done using the
DESCSET keyword:
DESCSET
0 9 UNUSED
9 2 9
11 2 UNUSED
13 1 13
14 18 UNUSED
32 95 32
127 1 UNUSED
Here, the target HTML character set that we need to define is divided
into subranges, with a clear identification of where characters in
each subrange come from. The first number in each line specifies the
starting code of the subrange; the second, its length; and the third
position is occupied either by a number that identifies the code to
start copying characters from the reference standard, or the
UNUSED keyword, which means that the characters in this
subrange are not allowed.
Thus, the first line in the preceding code means that the codes in the
range 0-8 inclusive (decimal) cannot be used within documents
conforming to the HTML 4.0 specification. The second line says that,
starting from code 9 onward, we borrow 2 characters that are coded 9
and 10 in the ISO 646 standard (in other words, within this two-
character range our character set is identical to ISO 646). The next
two characters are again unused, then we take one character with code
13, skip 18 more characters, and so on.
So, we have defined the first 128 characters of the HTML 4.0 character
set. However, to specify the remainder of the code table, we have to
refer to another standard. The syntax of the CHARSET
section allows the specification of as many external standards as
needed and the borrowing of characters from each of them (that is, to
have as many BASESET/DESCSET pairs as necessary).
| |
| |
BASESET "ISO Registration Number 176//CHARSET
ISO/IEC 10646-1:1993 UCS-2 with
implementation level 3//ESC 2/5 2/15 4/5"
DESCSET 128 32 UNUSED
160 65375 160
| |
| |
The previous version of HTML, 3.2, used to refer to the standard named
"ISO 8859-1" or "ISO Latin-1" to define the characters beyond 7-bit
ASCII. ISO Latin-1 uses 8-bit codes and therefore accommodates the
total of 256 characters (coded 0-255 inclusive), with the range
128-255 containing letters with diacritical marks used in different
European languages as well as some special symbols (trademark,
copyright, fractions, and so on). The first 128 characters of Latin-1
are identical to those of 7-bit ASCII.
However, the need for a better support of languages other than English
and Western European languages led to developing of a set of
provisions commonly referred to as "HTML Internationalization,"
initially described in RFC
2070 and then incorporated into HTML 4.0. One of the key
features of the internationalized HTML is the extended character set
that makes use of the Unicode coding standard. Unicode uses 16
bit (two bytes) codes and therefore covers as many as 65536
characters, including nearly all national alphabets of the world and
hordes of special symbols.
More precisely, HTML 4.0 refers to the ISO standard named "ISO/IEC
10646-1:1993" or simply "ISO 10646" which is a superset of Unicode and
generally uses four-byte codes. However, the UCS-2 in the
BASESET statement above identifies a special mode of ISO
10646 which uses two-byte codes and is in effect indistinguishable
from Unicode. All of these coding standards and related issues are
covered in much more detail in Chapter 39, "Internationalizing HTML."
One question that you may have by now, however, needs to be answered
immediately. Does the SGML declaration imply that with HTML 4.0, you
have to use Unicode for your documents? No, because the document
character set we're defining is different from the external
character encoding that the documents is in when created, stored
and served over the network. For the external character encoding, you
may use any character set standard that is best suited for the
document's content. In practice, the only area affected by the
document character set as per SGML declaration is numerical character
references such as   that must in HTML 4.0 point
to Unicode code positions. Again, for more details on these issues
refer to Chapter 39.
Unicode itself is a superset of Latin-1, as the first 256 characters
of Unicode are identical to those of Latin-1. Also, the latter is
likely to remain for a long time the most popular choice for the
external character encoding of HTML documents. In a separate
Table, we list the first 256 characters of the HTML
document character set as specified by SGML declaration for HTML 4.0.
| |
| |
This section is meant to provide a rough estimate of the
system resources (more specifically, different types of memory) that
an SGML parser will need to allocate in order to process the
DTD. This is not very reliable information, however, because
the memory usage is largely dependent on the internal architecture
of the parsing application. Most SGML parsers do not take
these values into account, and HTML creators simply assigned big
enough numbers to these parameters to ensure that processing the DTD
won't be aborted because of exceeding one of the CAPACITY
values. The CAPACITY parameters are not discussed
individually here; you can refer to the SGML specification for
details.
CAPACITY SGMLREF
TOTALCAP 150000
GRPCAP 150000
ENTCAP 150000
The SGMLREF keyword means that all CAPACITY
types that are not indicated here should take their default values
from the SGML reference concrete syntax. (See the next section for
more on this.)
| |
| |
The next major section of SGML declaration is introduced by the
SYNTAX keyword. It is provided to define various syntax
features of the SGML application, such as naming rules, delimiter and
control characters, reserved names and limits used by the DTD and
conforming SGML documents. This syntax is called "application concrete
syntax" as opposed to "reference concrete syntax" of SGML itself,
which is used in the SGML declaration (but not the DTD, as specified
by the SCOPE parameter). As you'll see shortly, in the case
of HTML, the differences between these syntaxes are minimal.
| |
| |
Immediately before the SYNTAX section comes the SCOPE
DOCUMENT declaration:
SCOPE DOCUMENT
Its sole purpose is to specify that the application concrete syntax to
be declared will be used not only by the conforming SGML documents but
also by the DTD of this SGML application.
| |
| |
Shunned Characters Declaration |
| |
The SYNTAX section starts with the list of shunned
characters' codes preceded by the SHUNCHAR keyword:
SHUNCHAR CONTROLS 0 1 2 3 4 5 6 7 8 9 10 11
12 13 14 15 16 17 18 19 20 21 22 23
24 25 26 27 28 29 30 31 127
"Shunned" doesn't mean "prohibited," and the list of shunned character
codes doesn't fully coincide with the UNUSED codes in the
character set declaration. In fact, some of the shunned characters
(for example, the carriage return and line feed characters) are
outright necessary in any text file, SGML document being no exception.
However, these characters should be used with care as their meaning
and usage may depend on the computer environment in which the text is
processed (for example, although a text line in MS DOS and Windows is
terminated by a pair of carriage return and line feed characters, UNIX
systems use single carriage return). The keyword CONTROLS
means that if a particular computer system uses some other characters
as control codes (and not displayable characters), these should be
added to the SHUNCHAR list.
| |
| |
Syntax Character Set Declaration |
| |
Next comes what may be considered a duplicate of the
CHARSET section---a BASESET/DESCSET pair
defining a character set (see "CHARSET Section" above):
BASESET "ISO 646:1983//CHARSET
International Reference Version
(IRV)//ESC 2/5 4/0"
DESCSET 0 128 0
What is the purpose of this additional definition?
The character set defined in the SYNTAX section is used
only within that section and nowhere else. This reminds us once again
of the fact that any text document, SGML declaration included, is
actually nothing but a sequence of codes, and to get to the meaning we
need to know which character corresponds to each code. Having provided
a separate character set declaration within the SYNTAX
section, we can ensure that the syntax definition is completely
independent of the document character set (defined in the
CHARSET section). In other words, we won't have to rewrite
the SYNTAX section when the content of CHARSET
section is changed.
| |
| |
Function Characters Declaration |
| |
The FUNCTION keyword is used to identify the character
codes for so-called function characters:
FUNCTION
RE 13
RS 10
SPACE 32
TAB SEPCHAR 9
Function characters are special characters that may have effect on
syntax. All function characters defined here are separators
whose role is identical to that of a white space. The RE
and RS identifiers denote simply carriage return and line
feed characters; they are short for Record End and Record Start,
respectively. (In SGML, a line in a text file is sometimes termed a
record, similar in a way to a database record.) TAB
(tabulation character) is not recognized as separator by SGML
standard, that is why it is accompanied by the additional classifier
SEPCHAR.
| |
| |
Next comes the NAMING declaration which regulates usage of
characters in element and entity names and as names' start characters:
| |
| |
NAMING LCNMSTRT ""
UCNMSTRT ""
LCNMCHAR ".-" -- ?include "~/_" for URLs? --
UCNMCHAR ".-"
| |
| |
To facilitate recognition of a name by the parser, the repertoire of
characters allowed in the first position of a name is limited as
compared to the rest of the name. SGML standard itself allows Latin
letters only as name start characters and Latin letters plus digits as
ordinary name characters, so here we only need to specify additions to
these sets. The characters are specified by using strings in quotes
(called literals), and separate parameters are provided for
indicating uppercase and lowercase character versions in each class.
Thus, the preceding lines tell us that in HTML, only Latin letters are
allowed as name first characters (the corresponding parameter strings
are empty) while the repertoire of ordinary name characters is
extended by the period and the hyphen. These characters are caseless
and thus are shown the same in both LCNMCHAR
(LowerCase NaMe
CHARacters) and UCNMCHAR
(UpperCase NaMe
CHARacters) parameters.
NAMECASE GENERAL YES
ENTITY NO
The NAMECASE declaration governs case sensitivity of the
SGML application concrete syntax. It is further subdivided into
ENTITY, which applies to entity names only (for more on
entities, see "Entities" below), and
GENERAL, which covers all the rest, including element
names. Here's the answer to the question of why <img>
and <IMG> are treated the same in HTML while
é and É aren't.
| |
| |
The DELIM declaration allows you to change the character
sequences used as tag delimiters in the SGML application.
DELIM GENERAL SGMLREF
SHORTREF SGMLREF
The values SGMLREF indicate that in this respect HTML
syntax is no different from SGML syntax; you use < as
start delimiter of an opening tag, </ as start
delimiter of a closing tag, > as end tag delimiter, and
so on. As this part of SGML declaration adds very little information
on HTML syntax, it need not be discussed in detail.
| |
| |
Reserved Names Declaration |
| |
The NAMES keyword may be used to change some of the
reserved SGML names that will be used in DTD declarations.
NAMES SGMLREF
Again, the SGMLREF value indicates that the list of these
reserved names is exactly that provided by SGML specification. Many of
these reserved names are discussed later in the section on DTD.
| |
| |
Quantity Limits Declaration |
| |
The last in the SYNTAX section is the QUANTITY
declaration:
| |
| |
QUANTITY SGMLREF
ATTSPLEN 65536 -- Implementors are recommended --
LITLEN 65536 -- to avoid fixed limits but --
NAMELEN 65536 -- this is the best we can say here --
PILEN 65536
TAGLVL 100
TAGLEN 65536
ATTCNT 100
GRPGTCNT 150
GRPCNT 64
| |
| |
This declaration sets limits for some lengths and counters used by the
parser in processing the DTD and conforming documents. Just like in
the CAPACITY section, many of these parameters are assigned
arbitrary big values that effectively mean "no limit at all;" it is
difficult to imagine that one might need, for example, an element name
(governed by the NAMELEN parameter) that is 65,536
characters long. Most HTML browsers disregard these limitations (or
have their own instead), so the different QUANTITY
parameters aren't discussed here.
| |
| |
The section of SGML declaration introduced by the FEATURES
keyword contains parameters that turn on or off some of the features
of SGML syntax; that is, they allow or disallow using these features
in the SGML application being defined. These features are divided into
three classes: MINIMIZE, LINK, and
OTHER. Following the HTML-oriented approach used throughout
the chapter, only those features that are turned on in the SGML
declaration for HTML 4.0 are considered here.
| |
| |
The MINIMIZE class contains the markup minimization
features that are intended to facilitate using SGML markup and to make
it more readable for humans. Minimization features allow you to omit
tags and other markup instructions in certain situations where context
is sufficient to resolve the resulting ambiguity.
- OMITTAG YES
- The OMITTAG feature allows the DTD to specify that
for certain elements, start and/or end tags may be omitted.
Such an element will be opened or closed based on matching the
context against the corresponding content model. (See the
upcoming "Elements" section.) The most
common example in HTML is the <P> tag, whose
closing tag </P> can always be safely omitted.
- SHORTTAG YES
-
This feature is very interesting. In fact, it contains a whole
bunch of different features that could save a lot of typing
when marking up a document. With SHORTTAG YES, you
can use empty open tag <>, empty closing tag
</>, type pairs of tags in the form
<TAGNAME/.../, omit attribute names, and so on,
with all missing information implied by the parser through
simple and effective rules. Unfortunately, common browsers do
not support these features, so they are mostly of theoretical
interest for HTML users.
| |
| |
The LINK class contains features that affect processing
attributes of elements. None of these are allowed in HTML.
| |
| |
The OTHER class contains miscellaneous features that didn't
fit into MINIMIZE or LINK classes.
- FORMAL YES
-
This feature indicates that the PUBLIC entity
declarations (see the section on public
identifiers) should use formal syntax of
public identifiers to enable automatic substitution of external
sources by the parser.
| |
       
|