| home / html / refactoring_html2 |
[previous] [next] |
|
|
Close all paragraphs, list items, table cells, and other nonempty elements.
The first motivation is simply XML compatibility. XML parsers require that each start-tag be matched by a corresponding end-tag.
However, theres a strong additional reason. Many documents do not display as intended in classic HTML when the end-tags are omitted. The problem is not that the browsers do not know how or where to insert end-tags. Its that authors often do not arrange the tags properly. All too often, the boundaries of an unclosed HTML element do not fall where the author expects. The result can be a document that appears quite different from what is expected. Indentation problems are the most common symptom (elements are not indented that should be, or elements are indented too far). However, all sorts of display problems can result. CSS is extremely hard to create and debug in the face of improperly closed elements.
Few and minimal. The resultant documents may be slightly larger. If youre not serving gigabytes per day, this is not worth worrying about.
Manually, you simply need to inspect each file and determine where the end-tags belong. For example, consider this table modeled after one in the HTML 4 specification:
Only the </table> end-tag is present. All the other end-tags are implied. A browser can probably figure this out. A human author might not and is likely to insert new content in the wrong place. Add end-tags after each element, like so:
Paragraphs are worth special attention here. When paragraph tags are omitted, the
start-tag usually serves as an end-tag rather than a start-tag. You'll commonly see content such as this tidbit from Through the Looking Glass:
When encountering text such as this, you'll want to turn each <p> into a </p>, and then add the missing start-tags like so:
Tidy and TagSoup can fix this. However, they usually incorrectly guess the proper location of the start-tag and produce markup such as this:
Tidy doesn't add the closing empty paragraph, but it still fails to find the start of the first paragraph. You can tell Tidy to wrap paragraphs around orphan text blocks using the --enclose-block-text option with the value y:
This doesn't matter for basic browser display, but it matters a great deal if you've assigned any specific CSS style rules to the p element. Furthermore, it can apply special formatting intended for the first paragraph of a chapter or section to the second instead.
Usually this happens only to the first paragraph in a section. However, if the runs of paragraphs are interrupted by a div, table, blockquote, or other element, there is likely such a block after each such block-level element.
Consequently, after running TagSoup over a page, search for empty paragraphs. Anytime you find one, it means theres probably a paragraph-less block of text earlier in the document that you should enclose in a new p element. However, this is tricky because often the start-tag and end-tag are on different lines. The following regular expression will find most occurrences:
This expression will find any empty paragraphs that have attributes:
However, such paragraphs werent created by Tidy or TagSoup, so you'll probably want to leave them in.
Close every element within its parent element.
Different browsers do not build the same trees from documents containing overlapping elements. Consequently, JavaScript can work very differently than you expect between browsers.
Furthermore, small changes in a document with overlap can make radical changes in the DOM tree that even a single browser builds. Consequently, JavaScript built on top of such documents is fragile. CSS is likewise fragile. JavaScript, CSS, and other programs that read a documents DOM are hard to create, debug, and maintain in the face of overlapping elements.
Sometimes the nature of the text really does call for overlapfor instance, when a quote begins in one paragraph and ends in another. This comes up frequently in Biblical scholarship, for instance. Not all text fits neatly into a tree.
Unfortunately, HTML, XML, and XHTML cannot handle overlap in any reasonable fashion. If you're doing scholarly textual analysis, you may need something more powerful still. However, this is rarely a concern for simple web publication. You can usually hack around the problem well enough for browser display by using more elements than may logically be called for.
A validator will report all areas where overlap is a problem. However, overlap is so confusing to tools that they may not diagnose it properly or in an obvious fashion. Different validators will report problems in different locations, and a single validator may report several errors related to one occurrence. Sometimes the problem will be indicated as an unclosed element or an end-tag without a start-tag, or both. For example:
Furthermore, an overlap problem may cause a parser to miss the starts or ends of other elements, and it may not be able to recover. It is very common for overlap to cause a cascading sequence of progressively more serious errors for the rest of the document. Thus, you should start at the beginning and fix one error at a time. Often, fixing an overlap problem eliminates many other error messages.
Repairing overlap is not hard. Sometimes the overlap is trivial, as when the end-tag for the parent element immediately precedes the end-tag for the child element. Then you just have to swap the end-tags. For example, change this:
to this:
If the overlap extends into another element, you close the overlapping element inside its first parent and reopen it in the last. For example, suppose you have these two paragraphs containing one quote:
Change them to two paragraphs, each containing a quote:
If there are intervening elements, you'll need to create new elements inside those as well.
Tidy and TagSoup can fix technical overlap problems but not especially well, and the result is usually not what you would expect. For example, Tidy will not always reopen an overlapping element inside the next element. For instance, it turns this:
into this:
It completely loses the quote in the second paragraph. TagSoup keeps the quote in the second paragraph but introduces a quote around the boundary whitespace between the two paragraphs:
Consequently, I prefer to fix these overlap problems by hand if there arent too many of them. You're more likely to reproduce the original intent that way.
This chapter is an excerpt from the book, Refactoring HTML: Improving the Design of Existing Web Applications by Elliotte Rusty Harold, published by Addison-Wesley Professional, May 2008, ISBN 0321503635, Copyright 2008 Pearson Education, Inc.| home / html / refactoring_html2 |
[previous] [next] |
URL: