Globalize your Web Applications: The Universal Character Set | WebReference

Globalize your Web Applications: The Universal Character Set

By Rob Gravelle


[next]

The first three articles of the Globalize your Web Applications series focused on the role of locale in developing global web apps. In particular, we examined locale support in the PHP language. Whereas that entailed a lot of programming code, today's article deals with something that's at the forefront of your web apps: the character sets. More specifically, we'll be taking a look at the Universal Character Set (UCS) and its role in creating multilingual web pages.

About the Unicode Standard

The Universal Character Set is a character coding system designed to support the worldwide exchange, processing, and display of written texts in the numerous languages of the modern world. Unicode is designed to allow single documents to contain characters or text from many scripts and languages, and to allow those documents to be used on computers with operating systems in any language while still remaining legible. Hence, it is ideally suited to the World Wide Web.

The HTML 4.0 Specification made a major step towards internationalizing the World Wide Web by adopting the Universal Character Standard (UCS) as the document character set for HTML. The RFC 2070 document about Internationalization of the Hypertext Markup Language has also been incorporated into HTML 4.0, which now includes provision for languages that are written right-to-left, such as Arabic and Hebrew, for appropriate punctuation, and for combining of letters and other idiosyncrasies. Recent versions of Internet Explorer go even further, with support for Mongolian, which is written top-to-bottom.

Adding Unicode characters to Web pages

Here are three ways of adding Unicode characters to your web page when you only want to insert a few, such as for mathematical symbols or a phrase in a different language:

  1. Character Entity References:
    There are 252 characters that can be included in an HTML file by typing a symbolic name between an ampersand (&) and a semicolon (;). For example, &ampamp; would display an ampersand character. These character entity references should work in any HTML file, regardless of the document's character encoding. One caveat is that not all characters have an entity reference. For those, you'll have to use methods 2 and 3.

  2. Numeric (Decimal) Character References:
    A second means of entering any Unicode character in an HTML file is by taking its decimal numeric character reference and inserting it between an ampersand (&) and a hash (#) at the front and a semi-colon (;) at the end. For example, & would again display an ampersand character (&). Like character entity references, numeric character references should also work in all HTML files, regardless of the page's character encoding.

  3. Hexadecimal Character References:
    Web browsers also recognize hexadecimal numbers. To denote a character in hexadecimal format, add a lower case x after the ampersand and hash character (#). The combination of & displays the ampersand character.

Special Unicode Characters in HTML

Certain characters are taken to have special meaning within the context of an HTML document. The first category is control characters. The second is characters which you cannot print using your keyboard or display using your PC's installed fonts.

C0 and C1 Control Characters

The C0 and C1 control code sets define control codes for use in text by computer systems. Control characters are non-printable characters that are typically used for communication and device control, as format effectors, and as information separators. Examples of C0 codes include 2 for the bell, 8 for a BACKSPACE, and 13 for a Carriage Return. Some C1 control codes include 86 and 87 for Start and End Selection respectively, as well as 141, which is a Reverse Line Feed. Whenever the browser comes across an unprintable control character, it will display an empty box such as these:

(backspace)
(escape)
 (delete)

In HTML, there are only three control characters, which are used. The remaining 55 control characters should not be used in an HTML document. The valid control characters and their interpretation are:

  • Horizontal Tab (HT - 9 decimal): Converted into a space by the browser in all contexts except when enclosed between <PRE> tags. Within <PRE> tags, the tab should be interpreted to shift the horizontal column position to the next tab position, which is a multiple of 8 spaces on the same line.

  • Line Feed (LF - 10 decimal): Interpreted as a space in all contexts except when enclosed between <PRE> tags. Within <PRE> tags, the line feed character should be interpreted as a shift to the start of a new line as it normally would be.

  • Carriage Return (CR - 13 decimal): Interpreted as a space in all contexts except when enclosed between <PRE> tags. Within <PRE> tags, the carriage return character should be interpreted as a shift to the start of the line, as usual.

The following table shows how to denote some of the control characters described above:

HTML Control Characters

Function Character Decimal Hexadecimal
EM space (not collapsed) &emsp; &#8195; &#x2003;
En space &ensp; &#8194; &#x2002;
Non-breaking space &nbsp; &#160; &#x00A0;
Horizontal tab &Tab;
&#09; &#x0009;
Line feed N/A
&#10; &#x000A;
Space N/A
&#32; &#x0020;

[next]