SVG, Unicode, and XSLT - Part 3 of Chapter 7 from Perl Graphics Programming (1/5)
Perl Graphics Programming, Chapter 7: Creating SVG with Perl
In 1998, RFC 2277 declared that the Internet was an international phenomenon, and that all new Internet standard protocols, languages, and formats should use Unicode (also referred to as ISO 10646) character set encodings. Sounds great; how do we do that in an SVG image?
Unicode is a standard set of character codes for representing multilingual text. In the early days of computing, vendors invented their own character encodings; it wasn't until 1968 that the ANSI standards group proposed the US-ASCII specification, which put forth an encoding table that represented all of the Latin alphanumeric characters in a standard 7-bit mapping. In the '80s, an attempt was made to create an internationalized character set with the ISO-8859-1 standard that provided a table for Latin, Cyrillic, Arabic, Greek, and Hebrew characters. Unicode is the modern synthesis of the standard encodings that came before it, with the goal of adding support for all the world's languages.
The early versions of Unicode proposed to represent a set of about 65,000 glyphs using 16 bits. The scope of the current version of Unicode has been expanded to potentially encode over a million different character glyphs, including glyphs from historic or ancient languages. More information on Unicode is available at the Unicode Consortium's web site, http://www.unicode.org/.
The Unicode standard provides three different methods for implementing the encoding:
- In UTF-8, character glyphs are represented as variable-length byte sequences. Standard Latin alphanumeric characters that correspond to the ASCII set are encoded in 8 bits, so if you are using only these characters, UTF-8 looks just like ASCII. UTF-8 is used by web browsers and is useful for keeping Unicode applications compatible with older software.
- In UTF-16, characters are encoded in 16 bits. This gives you access to just about every character you would ever use; some less frequently used characters are encoded using two 16-bit words.
- In UTF-32, each character is encoded using 32 bits. This encoding scheme is not popular, but may become more popular as massive computing resources become prevalent.
The Adobe SVG viewer supports both UTF-8 and UTF-16 character encodings. In Perl (Version 5.005_50 or later), strings are stored in the UTF-8 encoding, so we don't really have to worry about the low-level details of the encoding.
Created: February 26, 2003
Revised: February 26, 2003