The Flesh and the Soul of Information. The Stairway of Abstractions | WebReference

The Flesh and the Soul of Information. The Stairway of Abstractions

  The Stairway of Abstractions

Let's take a page of a document (which is its most common target form, although documents can exist in non-paged, e.g. aural, media as well) and consider in what different ways this page could be represented in a computer.  The first step in digitizing a paper page is usually scanning, which results in a bitmap file---an array of pixels representing small rectangular areas of the paper plane.  In a way, this representation is already an abstraction (remember, there were no pixels on the paper), but a very low-level one: it doesn't allow for easy access to the text of the page, nor make it possible to alter the appearance of the document in any significant way (although in this aspect computer bitmaps are admittedly more flexible than paper pages).

The next step in extracting the informational essence of a document would be storing it in some word processor format, such as MS Word, which contains both the text of the document and its formatting and, in principle, allows you to separate the two by exporting the plain text part of the document.  Although being much more flexible than a bitmap, this solution does not, however, achieve the proper goal of separating presentation from content: The plain text that you export is only a part of the document's content, and once the text is exported, you cannot re-combine it with formatting other than by a tedious manual procedure.

SGML's relation to conventional word processor formats is similar to that of these formats to a bitmap image of the page.  Thanks to its long history, wide support, and the status of international standard, SGML is the solution for storing and managing the abstract informational "souls" of documents completely divorced from their presentation "bodies."  I would like to refer you to my SGML chapter in HTML Unleashed, where the importance of such "psyche manipulation" is justified in more detail; here, suffices it to say that the main premises of SGML and its separation ideology are accessibility, portability, and as a result, longevity of information.  SGML repositories are like digital heaven, where immaterial souls of documents are hovering ready to incarnate into bodies of choice.

Thus, we see how each step of the stairway, leading from a material representation of a document through bitmaps and word processor formats to SGML, improves the flexibility and processability of data by heightening the level of abstraction applied to it.  At the same time, each step decreases the total amount of information stored (enough to compare file sizes of a TIFF scan, DOC file, and SGML document).  The pieces of information that are lost on each step are what constitutes the presentation part, or the "body," of the document; when converting the document back into material form (in which process we usually end up with a bitmap again, because only bitmaps can be printed or displayed on the screen), we need to somehow supply this missing information.

It is important to note that SGML is not, by itself, a tool for separating logical and presentational aspects of documents; it is intended only for storing various structured information.  An equally important point is that this structured information can not only be a "spiritual" outline of a document, but also, for example, an inventory of its "bodily" formatting features.  Although a usual SGML file looks like plain text broken into some named parts (called elements), it is more convenient to think about an SGML document as a database whose structure can have a widely varying level of rigidness (e.g. it may have optional fields, default values, variable field length).  In such a database, you can store virtually anything; however, it's up to you to develop the database structure and rules (called DTD) and prepare your information for storage.


Created: Apr. 19, 1998
Revised: Apr. 19, 1998