Translating Between Unicode and Non-Unicode Character Sets in Java | WebReference

Translating Between Unicode and Non-Unicode Character Sets in Java

By Rob Gravelle


Character encoding enables the transmission of human-readable data (i.e., numbers and/or text) and storage of text in computers by mapping numerical values to corresponding characters. Each collection of character encodings is grouped into a character set. The character set used by early desktop computers, for example, was ASCII. In the ASCII character set, the word "Help" would be represented by the numbers 72, 101, 108, and 112.

In my previous article Globalize Your Web Applications: The Universal Character Set, an introduction to some of the numerous character sets beyond those of Western languages, I discussed the Universal Character Set, or Unicode, a 16-bit character encoding created by the Unicode Technical Committee (UTC) to support the world's major languages. I also explained how to display Unicode characters in HTML using character references.

In this article, I explore the Java language's many useful APIs to help translate characters, strings, and text streams, as well as how these classes work together to convert from one mapping to another.

The Need for Conversion

Systems around the world use a variety of character encodings. Currently, few of these encodings conform to Unicode. Because a Java application expects characters in Unicode, the text data it gets from a system must be converted into Unicode, and vice versa. Data in text files is automatically converted to Unicode when the data's encoding matches the default file encoding of the Java Virtual Machine (JVM). In this case, you don't need to do any additional work. You can identify the default file encoding by creating an OutputStreamWriter, using it, and asking for its canonical name:

If the default file encoding differs from the encoding of the text data you want to process, then you must perform the conversion yourself. You might need to do this when processing text from another country or computing platform.

Unicode Character Representations

In Java, the char data type is based on the 4.0 Unicode specification, which defined characters as fixed-width, 16-bit entities. The Unicode Character Set, or Universal Character Set (UCS) as it's commonly known, contains nearly 100,000 abstract characters, each identified by an unambiguous name and an integer number called its code point. A code point is the value that a character is given in the Unicode standard. The values according to Unicode are written as hexadecimal numbers and have a prefix of "U+". For example, to encode the characters in the word "Help", "H" is U+0048, "e" is U+0065, "l" is U+006C, and "p" is U+1D18. These code points are split into 17 different sections called planes. Each plane holds 65,536 code points. The first plane, which holds the most commonly used characters, is known as the basic multilingual plane.

Unicode was recently expanded to UTF-32 four bytes (32 bits) when it became apparent that a 16-bit number is still too small to accommodate all the characters required to represent the world's major languages. The UTF-32 Character Set is now capable of representing every Unicode character as one number, and has a total range of U+0000 to U+10FFFF legal code points, known as Unicode scalar value. The original UTF-16 character set (from U+0000 to U+FFFF) is what is known as the Basic Multilingual Plane (BMP). Characters whose code points are greater than U+FFFF are called supplementary characters. To handle supplementary characters, the Java 2 platform stores UTF-32 characters -- which are rarely encountered in practice -- as a pair of char values, the first from the high-surrogates range, (\uD800-\uDBFF), the second from the low-surrogates range (\uDC00-\uDFFF).

Byte Array and String Conversion

If a byte array contains non-Unicode text, you can convert the text to Unicode with one of the String constructor methods. Conversely, you can convert a String object into a byte array of non-Unicode characters with the String.getBytes() method. When invoking either of these methods, you specify the encoding identifier as one of the parameters. The example that follows converts characters between UTF-8 and Unicode. UTF-8 is a transmission format for Unicode that is safe for UNIX file systems.

To convert the String object to UTF-8, call the getBytes() method with the appropriate encoding identifier as a parameter. The getBytes() method returns an array of bytes in UTF-8 format. To create a String object from an array of non-Unicode bytes, supply the encoding parameter to the String constructor. The code that makes these calls should be enclosed in a try block, in case the specified encoding is unsupported: