| home / html / refactoring_html3 |
[previous] [next] |
|
|
Convert < to <.
Although some browsers can recover from an unescaped less-than sign some of the time, not all can. An unescaped less-than sign is more likely than not to cause content to be hidden in the browser. Even if you arent transitioning to full XHTML, this one is a critical fix.
None. This change can only improve your web pages. However, you do need to be careful about embedded JavaScript within pages. In these cases, sometimes the less-than sign cannot be escaped. You can either move the script to an external document where the escaping is not necessary or reverse the sense of the comparison.
Because this is a real bug that does cause problems on pages, its unlikely to show up in a lot of places. You can usually find all the occurrences and fix them by hand.
I don't know one regular expression that will find all cases of these. However, a few will serve to find most. The first thing to look for is any less-than sign followed by whitespace. This is never legal in HTML. This regular expression will find those:
If you're not using any embedded JavaScript, you can search for <(\s) and replace it with <\1. However, if you're using JavaScript, you need to be more careful and should probably let Tidy or TagSoup do the work.
If your pages involve mathematics at all, its also worth doing a search for a < followed by a digit:
However, a validator such as xmllint or HTML Validator should easily find all cases of these, along with a few cases the simple search will mix.
Embedded JavaScript presents a problem here. JavaScript does not recognize < as a less-than sign. Inside JavaScript, you have to use the literal character. A less-than sign can usually be recast as a greater-than sign with arguments reversed. For example, instead of writing if (x < 7) you write if (7 > x)
However, I normally just rely on placing the script in an external file or an XML comment instead:
This is a truly ugly hack and one I cringe to even suggest, but it is what seems to work and what browsers expect and deal with, and it is well-formed.
A lot of these problems can spread out across a site when the site is dynamically generated from a database and the scripts or templates that generate it do not sufficiently clean the data they're working with. A typical SQL database has no trouble storing a string such as x>y in a VARCHAR field. However, when you take data out of a database you have to clean it first by escaping any such characters. Most major templating languages have functions for doing exactly this. For instance, in PHP the htmlspecialchars function converts the five reserved characters (>, <, &, ', and ") into the equivalent entity references. Just make sure you use it. Even if you think theres no possible way the data can contain reserved characters such as <, I still recommend cleaning it. It doesnt take long, and it can plug some nasty security holes that arise from people deliberately injecting weird data into your system.
You do not need to escape greater-than signs, although you can. The only situation where this is mandatory is when the three-character string ]]> appears in regular content. This is likely to happen only if you're writing an XML tutorial. (That's the CDATA section closing delimiter.) Nonetheless, if you're worried about someone attempting to inject bad data into your system, you can use a similar approach to change > to >.
& to &.
Although most browsers can handle a raw ampersand followed by whitespace, an ampersand not followed by whitespace confuses quite a few. An unescaped ampersand can hide content from the reader. Even if you arent transitioning to full XHTML, this refactoring is an important fix.
None. This change can only improve your web pages.
However, you do need to be careful about embedded JavaScript within pages. In these cases, the ampersand usually cannot be escaped. Sometimes you instead can use an external script where the escaping is not necessary. Other times, you can hide the script inside comments where the parser will not worry about the ampersands.
Because this is a bug that results in visible problems, there usually arent many cases of this. You can typically find all the occurrences and fix them by hand.
I don't know one regular expression that will find all unescaped ampersands. However, a few simple expressions will usually sniff them all out. First, look for any ampersand followed by whitespace. This is never legal in HTML. This regular expression will find those: &\s
If the pages don't contain embedded JavaScript, simply search for &(\s) and replace it with \&\1. A validator such as xmllint or HTML Validator will easily find all cases of these, along with a few cases the simple search will mix. However, if pages do contain JavaScript, you must be more careful and should let Tidy or TagSoup do the work.
Embedded JavaScript presents a special problem here. JavaScript does not recognize & as an ampersand. JavaScript code must use the literal & character. I normally place the script in an external file or an XML comment instead:
If a site is dynamically generated from a database, this problem can become more frequent. A SQL database has no trouble storing a string such as "A&P" in a field, and indeed it is the unescaped string that should be stored.
When you receive data from a database or any other external source, clean it first by escaping these ampersands. For example, in a Java environment, the Apache Commons library includes a String-EscapeUtils class that can encode raw data using either XML or HTML rules.
Do not forget to escape ampersands that appear in URL query strings. In particular, a URL such as this:
must become this:
This is true even inside href attributes of a elements:
| home / html / refactoring_html3 |
[previous] [next] |
URL: