HTML Unleashed. SGML and the HTML DTD: Procedural and Descriptive Markup | WebReference

HTML Unleashed. SGML and the HTML DTD: Procedural and Descriptive Markup


HTML Unleashed: SGML and the HTML DTD

Procedural and Descriptive Markup


Any document can be thought of as consisting of content and markup.  The content is relatively straightforward; it bands together all the characters of text, images, and the rest of the meat that delivers the message of your document.  However, if you just add together all this stuff you won't get a readable document.  You need to mark it up---that is, to introduce some new information into it that couldn't be automatically deduced from the content.

In fact, even a plain ASCII text, often cited as an example of pure text without any formatting, contains a fair amount of markup.  For instance, it is markup that allows you to determine what the width of text column in a plain text file is, or where the boundaries between paragraphs lie.  Of course, the inventory of markup instructions (called tags) that you can apply is determined by the format of your document and the tools you use to work with it.

So what is the purpose of markup?  In other words, what information can, and should, be conveyed by a document beyond its content?  This extra information can be divided into two groups: structure and formatting.  Structure tells us how the document logically breaks down into parts (paragraphs, sections, chapters) and how these parts are hierarchically organized, from the document itself at the very top to the atoms of content at the very bottom.  Formatting (in a broad sense) governs presentation of the document: which fonts to use to display it, in what tone of voice to read it aloud, how to break it into pages for printing, and so on.

Here enters the Great Markup Controversy that was one of the main driving forces behind the creation of SGML and whose backwash is still disturbing the HTML community.  Its essence, without the fear of oversimplification, can be reduced to the question of which of the two markup types is more important and should be given priority.

Obviously, many contemporary text processing systems are heavily biased towards marking up formatting aspects of a document in the first place (this sort of markup is often called procedural or presentational).  The reason is obvious: What an average user needs most often is a formatted document, not a structural diagram of its parts.

However, the bitter truth is that procedural markup may actually impede using and reusing the document if not accompanied, and even preceded, by proper structural (often called descriptive or generic) markup.  For instance, if a document file contains instructions to set a line of text in Times Roman, size 12 pt, and left-aligned, but never hints that this line is a heading or a figure caption, then this markup is very restrictive.

What happens is you won't be able to change the formatting of all headings in a document at once.  You'll have difficulty exporting the document into another format or another medium (such as a voice synthesis system), which may be using different means of formatting headings.  You cannot even automatically generate a table of contents.  To put it simply, unless you know what this part of text is supposed to be, information on how to render it is of very limited value.

On the other hand, after you have added information about the logical role of every element in the text, the natural next step is to attach the presentational markup to these logical tags rather than actual parts of the text.  Now you don't have to specify font size for a heading in your text any more; it is enough to mark it up as a heading, and the rest is taken care of automatically (provided that someone has associated certain formatting parameters with the structural heading tag).

This concept, called "separation of presentation from content," is the major advantage of all systems that put descriptive markup first.  When separated, both content and formatting can be developed by different people and modified much more easily.  Thus, one of the roles of descriptive markup is to serve as an intermediate layer separating content and formatting of a document.

It should be admitted that text processing systems that are in common use now (such as office word processors) do not completely ignore the benefits of descriptive markup.  The named styles that you apply to paragraphs in, say, Microsoft Word, represent some sort of descriptive markup units with certain formatting tags associated with each of them.  Moreover, users can create new styles as needed for their documents.  This provides for a certain level of separation of presentation from content.

However, such a solution is only partial because style tags do not impose any restrictions on the structure of your document.  For example, it's no problem to assign a heading style to a paragraph inside a figure caption or a footnote---which is pretty much senseless.  Also, you are in no way discouraged from making direct changes to text attributes, such as font face or size, thereby overriding their values in a style.  Styles in word processors are mere containers for presentation attributes and not a means to impose some prescribed structure on a document being created or processed.

You might wonder, do we really need to impose any structure on the document contents?  Yes, and here's why: You can't predict what uses will be made of your document tomorrow or in a year, what formats it'll need to be converted to, or what media it'll be put onto.  By using a strictly defined set of hierarchical descriptive tags, you ensure that the text can be processed automatically without any need to manually disambiguate cases such as a heading inside a footnote.  I could say that descriptive markup reveals the immaterial soul of a document so that any program or person can then conveniently incarnate it into a body of choice.

The provision for automatic processing is the advantage that outshines all others.  It is difficult to imagine how many resources humankind spends annually on preparation, processing, and interchange of documents.  Office computing and desktop publishing software made this work easier, but, in many cases, proprietary and presentation-oriented tools put more handicaps in the flow of documentation than they deliver benefits.  An open and extensible system of descriptive markup would thus be invaluable in many situations.

To summarize, what we need is a markup system focusing on structure of a document rather than its formatting.  It should allow us to build a hierarchy of descriptive tags so that they could serve not only to separate and describe different parts of a document but also to formally prescribe its structure.

An equally important requirement is that the system should provide for easy extension and modification.  Ideally, a user should be able to define a completely new set of tags if such a need arises.  Finally, this system should not be proprietary; it is important that anyone be free to create and use markup tools based on this system and to produce software implementing these tools.

SGML is the system designed to satisfy all of these requirements, as well as many others.  SGML is strictly descriptive and contains no means to mark up presentational aspects of documents.  However, SGML can be easily interfaced to external procedural markup systems and style sheets.

It is the customizability area where SGML reveals its real power.  In fact, SGML is not a markup system by itself; it is, rather, a metasystem enabling users to create such systems for particular types of documents.  Its flexible syntax makes it possible to build markup languages (HTML being one of the examples) to match any imaginable demand.  Moreover, any single SGML document can be provided with its own "local" markup definitions fine-tuned for the particular purpose.

Just like HTML, SGML is a computer language rather than a data format.  This means that you can create SGML files manually in a text editor, although there exist software tools that facilitate the task.  A piece of software that reads and analyzes an SGML document (for example, for transformation or validation) is called an SGML parser.  A parser by itself, however, is not very useful because of the purely descriptive nature of SGML, so most often a parser is a part of a bigger document processing or browsing application.


Created: Jun. 15, 1997
Revised: Jun. 16, 1997