|
ny
document can be thought of as consisting of content and
markup. The content is relatively straightforward; it
bands together all the characters of text, images, and the rest of
the meat that delivers the message of your document. However,
if you just add together all this stuff you won't get a readable
document. You need to mark it up---that is, to introduce some
new information into it that couldn't be automatically deduced from
the content.
In fact, even a plain ASCII text, often cited as an example of pure
text without any formatting, contains a fair amount of markup. For
instance, it is markup that allows you to determine what the width of
text column in a plain text file is, or where the boundaries between
paragraphs lie. Of course, the inventory of markup instructions
(called tags) that you can apply is determined by the format of
your document and the tools you use to work with it.
So what is the purpose of markup? In other words, what information
can, and should, be conveyed by a document beyond its content? This
extra information can be divided into two groups: structure and
formatting. Structure tells us how the document logically
breaks down into parts (paragraphs, sections, chapters) and how these
parts are hierarchically organized, from the document itself at the
very top to the atoms of content at the very bottom. Formatting (in a
broad sense) governs presentation of the document: which fonts to use
to display it, in what tone of voice to read it aloud, how to break it
into pages for printing, and so on.
Here enters the Great Markup Controversy that was one of the main
driving forces behind the creation of SGML and whose backwash is
still disturbing the HTML community. Its essence, without the fear of
oversimplification, can be reduced to the question of which of the two
markup types is more important and should be given priority.
Obviously, many contemporary text processing systems are heavily
biased towards marking up formatting aspects of a document in the
first place (this sort of markup is often called procedural or
presentational). The reason is obvious: What an average user
needs most often is a formatted document, not a structural diagram of
its parts.
However, the bitter truth is that procedural markup may actually
impede using and reusing the document if not accompanied, and even
preceded, by proper structural (often called descriptive or
generic) markup. For instance, if a document file contains
instructions to set a line of text in Times Roman, size 12 pt, and
left-aligned, but never hints that this line is a heading or a figure
caption, then this markup is very restrictive.
What happens is you won't be able to change the formatting of all
headings in a document at once. You'll have difficulty
exporting the document into another format or another medium (such
as a voice synthesis system), which may be using different means of
formatting headings. You cannot even automatically generate a
table of contents. To put it simply, unless you know what this
part of text is supposed to be, information on how to
render it is of very limited value.
On the other hand, after you have added information about the logical
role of every element in the text, the natural next step is to attach
the presentational markup to these logical tags rather than actual
parts of the text. Now you don't have to specify font size for a
heading in your text any more; it is enough to mark it up as a
heading, and the rest is taken care of automatically (provided that
someone has associated certain formatting parameters with the
structural heading tag).
This concept, called "separation of presentation from content," is the
major advantage of all systems that put descriptive markup first. When
separated, both content and formatting can be developed by different
people and modified much more easily. Thus, one of the roles of
descriptive markup is to serve as an intermediate layer separating
content and formatting of a document.
It should be admitted that text processing systems that are in common
use now (such as office word processors) do not completely ignore the
benefits of descriptive markup. The named styles that you apply to
paragraphs in, say, Microsoft Word, represent some sort of descriptive
markup units with certain formatting tags associated with each of
them. Moreover, users can create new styles as needed for their
documents. This provides for a certain level of separation of
presentation from content.
However, such a solution is only partial because style tags do not
impose any restrictions on the structure of your document. For
example, it's no problem to assign a heading style to a paragraph
inside a figure caption or a footnote---which is pretty much
senseless. Also, you are in no way discouraged from making direct
changes to text attributes, such as font face or size, thereby
overriding their values in a style. Styles in word processors are mere
containers for presentation attributes and not a means to impose some
prescribed structure on a document being created or processed.
You might wonder, do we really need to impose any structure on the
document contents? Yes, and here's why: You can't predict what uses
will be made of your document tomorrow or in a year, what formats
it'll need to be converted to, or what media it'll be put onto. By
using a strictly defined set of hierarchical descriptive tags, you
ensure that the text can be processed automatically without any need
to manually disambiguate cases such as a heading inside a footnote. I
could say that descriptive markup reveals the immaterial soul of a
document so that any program or person can then conveniently incarnate
it into a body of choice.
The provision for automatic processing is the advantage that outshines
all others. It is difficult to imagine how many resources humankind
spends annually on preparation, processing, and interchange of
documents. Office computing and desktop publishing software made this
work easier, but, in many cases, proprietary and presentation-oriented
tools put more handicaps in the flow of documentation than they deliver
benefits. An open and extensible system of descriptive markup would
thus be invaluable in many situations.
To summarize, what we need is a markup system focusing on structure of
a document rather than its formatting. It should allow us to build a
hierarchy of descriptive tags so that they could serve not only to
separate and describe different parts of a document but also to
formally prescribe its structure.
An equally important requirement is that the system should provide for
easy extension and modification. Ideally, a user should be able to
define a completely new set of tags if such a need arises. Finally,
this system should not be proprietary; it is important that anyone be
free to create and use markup tools based on this system and to
produce software implementing these tools.
SGML is the system designed to satisfy all of these requirements, as
well as many others. SGML is strictly descriptive and contains no
means to mark up presentational aspects of documents. However, SGML
can be easily interfaced to external procedural markup systems and
style sheets.
It is the customizability area where SGML reveals its real power. In
fact, SGML is not a markup system by itself; it is, rather, a
metasystem enabling users to create such systems for particular
types of documents. Its flexible syntax makes it possible to build
markup languages (HTML being one of the examples) to match any
imaginable demand. Moreover, any single SGML document can be provided
with its own "local" markup definitions fine-tuned for the particular
purpose.
Just like HTML, SGML is a computer language rather than a data format.
This means that you can create SGML files manually in a text editor,
although there exist software tools that facilitate the task. A piece
of software that reads and analyzes an SGML document (for example, for
transformation or validation) is called an SGML parser. A parser by
itself, however, is not very useful because of the purely descriptive
nature of SGML, so most often a parser is a part of a bigger document
processing or browsing application.
|