| |
he recent XML buzz in Internet media can
easily perplex an inexperienced user. One of the biggest myths
about the new language is that you can easily and automatically convert
valid HTML to XML. To be sure you can, and the result will
pass validation as well-formed
or even valid XML. But
what's the worth of such a conversion?
What XML is all about are not syntactic innovations such as quotes around
attribute values and trailing slashes in empty tags. XML's goal is to
comprehensively mark up all details of a given unit of information,
without mixing data belonging to different units or different aspects of
one unit. From this viewpoint, a tag-wise conversion of "real
world" HTML, with its hopeless medley of logical and visual elements, to
XML doesn't make any sense at all. On most sites, HTML bears
little relation not only to the logical structure of pages, but,
properly speaking, to the presentation aspect as well: It does not
describe formatting of the pages, but only emulates it by
using tables, invisible spacers and similar hacks.
So, we have to forget about the XML promise for
now---until we take the trouble of re-formulating all our
data consistently, be it in its presentation or (more importantly)
content aspects. This is well known to those who take XML
seriously and are aware of what it can offer, and a growing number of new
document collections and software tools are being built from ground up
using XML-inspired approaches. However, the huge legacy of
existing HTML documents needs special treatment.
As you may have guessed, modular HTML is an essential transition stage
on the way to XML. Just as you can update all instances of one module
throughout the site by a global search-and-replace, you can use the same
technique to replace your HTML modules with logical XML tags. For
example, the XML for the above heading module could look like this:
<FRAMED-HEADING>Details</FRAMED-HEADING>
This expression is not only correct XML, but most importantly, it
perfectly fits into the ideology of generalized markup, as all traces of
presentation machinery are eliminated and what remains is a purely
logical declaration stating what this element is, not how it is
formatted. (Admittedly, the name of the XML element,
FRAMED-HEADING, was coined in acknowledgement of its intended
visual presentation, but this is done only to preserve same consistency
in the source markup, while actual formatting of this element may, with
time, deviate quite far from the original.)
Note in particular that the text part of the heading is kept unchanged
in the conversion except for one difference: in HTML, the heading
was in all caps while in XML it is in the conventional initial caps
form. The all caps spelling of HTML is dictated by purely visual
considerations, therefore this aspect was deemed irrelevant in the
purely logical XML markup. The stylesheet to be attached to this
document will have to take care, among other things, of capitalizing the
content of all FRAMED-HEADING elements for display. |
|