Tutorial 17: Shady Characters

Selecting an encoding

So, which character encoding should you use for your documents? That's actually quite a thorny issue. Different character encodings cover different sets of characters. There are a few encodings that cover all of UCS, and these are preferable, but it really all depends on your software.

A lot of modern computer software tends to second-guess you when it comes to encodings. Your best bet is to use software (text editor, HTML editor, Web server) that has a clear idea of what character encoding it is using. Personally, I use GNU Emacs with the MULE multi-language extensions which goes quite overboard in allowing you to select possibly every encoding under the sun, but that might not be quite your cup of tea. Most notably, beware of operating systems like Microsoft Windows that tend to take charge of character encoding behind the scenes without letting your program into the whole process (kind of like the WTO). Windows tends to use a whole bunch of proprietary encodings that are similar, but not quite compatible to some popular encodings (most notably ISO Latin-1, which is eerily similar to the Windows-1252 encoding, but with a couple of very significant incompatibilities. If you use Windows software to create your documents, make sure to specify that you're using a Windows encoding; we'll see how to do this later on).

In short, you need to select an encoding that (a) covers all the characters you want to use, (b) is understood by whatever software you use to author your documents and (c) is fairly common. Common encodings include the ISO-8859-x series of encodings, SHIFT-JIS and EUC-JP for Japanese, and the various Unicode encodings (UTF-4, UTF-8, UTF-16 etc.).



