|
|
 |
|
| |
HTML Unleashed: Internationalizing HTML |
Language-Specific Presentation Markup
|
| |
n
a multilanguage environment, a need may arise to specify in HTML
some aspects of text presentation, such as the writing direction (left
to right or right to left), punctuation peculiarities, and so on.
These aspects usually can be derived from the language of the text
(see preceding section), but sometimes one
may need to specify this information without specifying the language
or to override the language default values. Also, some
presentation aspects (such as quotation marks) require additional
markup even if a language is specified.
RFC 2070 introduces and HTML 4.0 adopts a whole bouquet of new HTML
elements, attributes, and entities for this sort of presentation
markup. These new features are summarized in the following sections.
| |
| |
While most Western languages are written from left to right, such
languages as Arabic and Hebrew are written from right to left. In
situations when such text is intermingled with the text of the
opposite direction (resulting in a bidirectional, or BIDI,
text), a special markup may be necessary to resolve ambiguity.
Unicode standard has a number of direction-related provisions. Each
Unicode character is assigned the bidirectional category
parameter that may take a number of different values, such as
left-to-right, right-to-left, number separator, or neutral (for
example, white space). Some characters (such as parentheses)
are marked as mirrored depending on the text direction (in
right-to-left text, an opening parenthesis should take the
appearance of a closing one and vice versa).
To support this behavior, RFC 2070 introduces directional markup
tools of three types. The first type consists of the left-to-right and
right-to-left marks that behave exactly as zero-width spaces having
corresponding direction properties. These marks are taken directly
from Unicode inventory, so in HTML they are implemented as entities
expanding into corresponding Unicode characters:
| |
| |
<!ENTITY lrm CDATA "‎" -- left-to-right mark -->
<!ENTITY rlm CDATA "‏" -- right-to-left mark -->
| |
| |
Direction marks may be used when, for example, a double quote
(which doesn't have a direction of its own, but is not a mirrored
character either) sits between a Latin and a Hebrew character; in
this situation, the actual place of the quote depends on whether it
is assumed to belong to the left-to-right or the right-to-left text
stream. By placing an invisible direction mark
(‎ or ‏) on one side of the
quote, you can ensure that the quote is surrounded by characters of
the same directionality, thereby resolving the ambiguity.
The second type of direction markup is represented by the new
DIR attribute that, like LANG, can be used
with nearly all HTML tags to indicate the writing direction of the
text in the element's contents. Sometimes you may need to
indicate the basic writing direction of a piece of text; also,
explicit direction markup is critical when there are two or more
levels of nested contra-directional text (for an example, refer to
RFC 2070).
The two possible values for the DIR attribute are
strings rtl (right-to-left) and ltr
(left-to-right). As is the case with CSS attributes, you can
use the DIR attribute when no element is normally
discriminated by using the SPAN element as a sort of a
neutral container. If the DIR attribute is omitted,
the element inherits the writing direction of its parent
element. The entire HTML document's default direction is left
to right.
For brevity, definitions of the DIR and LANG attributes are packed
into one parameter entity in the HTML 4.0 DTD:
| |
| |
<!ENTITY % i18n
"lang NAME #IMPLIED -- RFC 1766 language value --
dir (ltr|rtl) #IMPLIED -- default directionality --"
>
| |
| |
Later, the %i18n; entity is added to the
ATTLIST declarations for the majority of HTML elements.
Finally, the third type of direction markup is represented by the
new phrase-level BDO element (BDO stands for
BiDirectional Override). It is used when a mix of
left-to-right and right-to-left characters should be displayed in a
single direction, overriding the intrinsic directional properties of
the characters. For the BDO element, DIR
is the only obligatory attribute.
| |
| |
In some writing systems (most notably Arabic), a letter's glyph may
be different depending on the context---that is, on whether the
letter is preceded or followed by some other letters. Arabic letters
are modeled after handwritten cursive prototypes, so a letter in a
middle of a word is drawn joined to its neighbors and therefore may
look quite different than it does when it is isolated.
As a rule, software capable of displaying Arabic handles these
differences automatically. But sometimes it's necessary to control the
joining behavior, for example, to exemplify a standalone letter with
cursive joiners. For this, Unicode provides two special characters,
both being invisible and having zero width, the first to force joining
of adjacent characters where normally no joining would occur, and the
second to prevent joining that would normally take place. HTML 4.0
provides means to access these characters in HTML via the
‍ and ‌ mnemonic character
entities:
| |
| |
<!ENTITY zwnj CDATA "‌" -- zero width non-joiner -->
<!ENTITY zwj CDATA "‍" -- zero width joiner -->
| |
| |
A number of different styles exist to render quotation marks around
short, in-text quotations. Although the English language always uses
quotes like this, French has « comme ça », and
German prefers wie hier. Moreover, nested comments sometimes
use different styles; for example, Russian tradition uses French
quotes (without separating spaces) on the upper level and German
quotes for quotations within quotations. Finally, it is
desirable to be able to render the same text with rich quotes
() in a
graphics environment but with plain double quotes of 7-bit ASCII
("") in text-mode browsers.
To account for these differences, HTML 4.0 offers the new
phrase-level Q element whose content is surrounded by a
pair of quotation marks rendered in accordance with the language of
the text, the level of nesting, and the display capabilities
available. For example:
<P LANG="en">The English language always uses
quotes <Q>like this</Q>,
French has <Q LANG="fr">comme ça</Q>,
and German prefers <Q LANG="de">wie hier</Q>.</P>
Unfortunately, this solution is not backwards compatible; most existing
software will just ignore Q tags without displaying even
the plain ASCII quotes, which can often damage the meaning of the
text. Thus, practical use of Q elements is not encouraged
until the majority of user agent software provides support for the
feature.
| |
| |
Alignment and Hyphenation |
| |
Traditions of using text justification modes in other languages may
be quite different from those of English. That is why RFC 2070
introduces the optional ALIGN attribute that may be used
with most block-level elements (namely P, HR,
H1 to H6, OL, UL,
DIV, MENU, LI, BLOCKQUOTE,
and ADDRESS) with the values of left,
right, center, and justify. RFC 2070
suggests that the default ALIGN value for texts with
left-to-right writing direction should be left, and for
right-to-left texts, right.
This is a significant improvement over HTML 3.2, where the list of
elements supporting this attribute is shorter (only DIV,
H1 to H6, HR, TD, and
P) and the value "justify" is not allowed.
Judging from the DTD, HTML 4.0 takes a halfway approach: it adopts the
justify option but leaves the list of elements accepting
the ALIGN attribute the same as in HTML 3.2.
As for hyphenation, user agents are supposed to apply
language-dependent rules to break words if this is necessary for
proper display. In complex or critical cases, RFC 2070
suggests that HTML authors use the mnemonic entity
­ that invokes the SOFT HYPHEN
character present in Unicode as well as all of the ISO 8859 family
and other character sets.
This invisible character marks the point where a word break can
occur; if the word is indeed broken, the character is visualized as a
usual hyphen (-) character. Unfortunately, common browsers
do not implement this behavior; what's worse, both Netscape Navigator
and Microsoft Internet Explorer always display a -
in place of a soft hyphen, thus preventing you from using this
character whatsoever.
For better hyphenation control, the new HYPH element was
proposed that is capable of handling complex cases when breaking a
word is accompanied by a change in its spelling (for example, the
German word backen becomes bak-ken when hyphenated).
However, the HYPH element was not included in either RFC
2070 or HTML 4.0.
| |
      
 |
|