SGML: Meet Your Maker
As I have mentioned before, SGML, the Standard Generalized Markup Language, is a means of defining markup languages. A lot of the concepts in HTML, such as elements, attributes and tags, are inherited from SGML.
SGML is used to create a formal definition of HTML. All of the characteristics that we mentioned in the previous tutorial, such as which elements can be contained where, which tags can be omitted and which elements accept which attributes, are described using SGML. Why is this important?
It is important because you can then use programs to make sure your HTML document is properly written. These programs, called validators, check to see wether HTML documents conform to the syntax rules for HTML, and point out any errors. This makes it easy to spot errors (such as forgetting the greater-than symbol to end a tag, forgetting an end-tag that isn't implied or nesting elements improperly) and correct them. SGML also offers other exciting capabilities such as easily converting your document to other formats suitable for different uses such as printing. SGML allows us to define and check a specific syntax for HTML.
You don't need to learn SGML in order to author HTML documents. Although a cursory understanding of its workings is useful, it is by no means necessary. The only thing you have to know is how to insert Document Type Declarations in your HTML documents. Thankfully, this is the easy part.
We have already seen document type declarations in our examples. Here's the one we've been using:
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0//EN" "http://www.w3.org/TR/REC-html40/strict.dtd">
This is the Document Type Declaration that indicates that the document follows HTML 4.0. You can just copy the above verbatim and place it at the top of each of your documents. It tells any program that processes your document what language it is in (HTML), which version of this language it is in (4.0) and where to find information about the syntax of this language (the URL you see in there).
I mentioned earlier that W3C specifications try to reflect common practice as much as possible as well as preserve backward compatibility with earlier versions of HTML. Hence some parts of the specification are deprecated. These parts are there because they either reflect common practice that is discouraged, but some times required, or because they represent parts of the specification that are being abandoned for alternatives, but are still in wide use. The version of HTML that includes all the deprecated parts is called HTML 4.0 Transitional, since it represents a transitional phase between what exists now and what will hopefully be used when HTML 4.0 is widely adopted.
Since, as we mentioned, writing strictly to the specification is next to impossible, this is the version you'll be using most of the time, until this industry shapes up and begins acting consistently. This is the Document Type Declaration for HTML 4.0 Transitional:
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
There a couple of other Document Type Declarations that you might use in the future, but we'll deal with those when we need them.
A useful note at this point is that SGML does not entirely define
HTML as a language. It only defines the largest part of its syntax. For
instance, SGML tools cannot check wether the value of an
HREF attribute is a valid URL or not. For anything that
SGML cannot describe, the HTML specification is there to clarify and
provide the definitive answer.
Both Explorer and Navigator blatantly ignore Document Type Declarations. This is a shame, since this would be the logical mechanism to use in order to rid ourselves of all the improper extensions in the browsers. As another WebReference columnist once said, however, who said logic was part of this industry?
This provides us a hint of the answer to the question I posed earlier: How do we write HTML? Let's see if figure out what it means to write proper HTML.