HTML Unleashed. Internationalizing HTML: Language Identification
HTML Unleashed: Internationalizing HTML
haracter set problems constitute only a part of the whole HTML internationalization issue. Almost equally important is the problem of language identification of a document. Lots of aspects of document presentation depend not only on the character set, but also on the language of the text.
For example, as I've mentioned before, the same ideographs are used in many Far East languages, so that in each language they are rendered by slightly different glyphs and quite different sounds of speech. Also, different languages using the same character set may differ greatly in respect to hyphenation, spacing, use of punctuation, and so on.
To this end, HTML 4.0 introduces the new LANG attribute, which can be used with most HTML elements to describe the language of the element contents. A "language" in this context is defined as "spoken (or written) by human beings for communication of information to other human beings; computer languages are explicitly excluded." For example:
<P LANG="fr">Ce paragraphe est en Français</P>
The LANG attribute may take as a value a two-letter abbreviated code (or tag) of the language. A list of these codes is defined by ISO 639 standard; these codes should not be confused with country codes (for example, uk as a language code means Ukrainian, not United Kingdom).
Also, extended identifiers may be used to designate different dialects or writing systems of a language, identify the country in which it is used, and so forth. These extended identifiers are based on two-letter codes with the addition of subtags separated by a hyphen (-), for example:
A registry of such extended language identifiers is maintained by IANA. All LANG values are case insensitive; their complete syntax is defined by RFC 1766. Another useful resource is the document where most known languages are listed along with the character sets they use.
Revised: Jun. 16, 1997