| home / programming / xml / 1 | [previous][next] |
|
|
The first 32 Unicode characters with code points from 0 to 31 are known as
the C0 controls. They were originally defined in ASCII to control teletypes
and other monospace dumb terminals. Aside from the tab, carriage return, and
line feed they have no obvious meaning in text. Since XML is text, it does not
include binary characters such as NULL (#x00), BEL (#x07), DC1 (#x11) through
DC4 (#x14), and so forth. These noncharacters are historical relics. XML 1.0
does not allow them. This is a good
thing. Although dumb terminals and binary-hostile gateways are far less common
today than they were twenty years ago, they are still used, and
passing these characters through equipment that expects to see plain text
can have nasty consequences, including disabling the screen. (One com-mon
problem that still occurs is accidentally paging a binary file on a con-sole.
This is generally quite ugly and often disables the console.)
A few of these characters occasionally do appear in non-XML text data. For example, the form feed (#x0C) is sometimes used to indicate a page break. Thus moving data from a non-XML system such as a BLOB or CLOB field in a database into an XML document can unexpectedly cause malformedness errors. Text may need to be cleaned before it can be added to an XML document. However, the far more common problem is that a document’s encoding is misidentified, for example, defaulted as UTF-8 when it’s really UTF-16 or ISO-8859-1. In this case, the parser will notice unexpected nulls and throw a wellformedness error.
XML 1.1 fortunately still does not allow raw binary data in an XML document. However, it does allow you to use character references to escape the C0 controls such as form feed and BEL. The parser will resolve them into the actual characters before reporting the data to the client application. You simply can’t include them directly. For example, the following document uses form feeds to separate pages.

However, this style of page break died out with the line printer. Modern systems use stylesheets or explicit markup to indicate page boundaries. For example, you might place each separate page inside a page element or add a pagebreak element where you wanted the break to occur, as shown below.

Better yet, you might not change the markup at all, just write a stylesheet that assigns each rhyme to a separate page. Any of these options would be superior to using form feeds. Most uses of the other C0 controls are equally obsolete.
There is one exception. You still cannot embed a null in an XML document, not even with a character reference. Allowing this would have caused massive problems for C, C++, and other languages that use null-terminated strings. The null is still forbidden, even with character escaping, which means it’s still not possible to directly embed binary data in XML. You have to encode it using Base64 or some similar format first. (See Item 19.)
| home / programming / xml / 1 | [previous][next] |
Created: March 27, 2003
Revised: October 25, 2003
URL: http://webreference.com/programming/xml/1