8 Language information and text
direction
This section of the document discusses two important issues that affect the
internationalization of HTML: specifying the language (the lang
attribute) and direction (the
dir attribute) of text in a document.
Attribute definitions
- lang = language-code [CI]
- This attribute specifies the base language of an element's attribute values
and text content. The default value of this attribute is unknown.
Language information specified via the lang
attribute may be used by a user agent to control rendering in a variety of
ways. Some situations where author-supplied language information may be helpful
include:
- Assisting search engines
- Assisting speech synthesizers
- Helping a user agent select glyph variants for high quality typography
- Helping a user agent choose a set of quotation marks
- Helping a user agent make decisions about hyphenation, ligatures, and spacing
- Assisting spell checkers and grammar checkers
The
lang attribute specifies the language of element content and
attribute values; whether it is relevant
for a given attribute depends on the syntax and semantics of the attribute and
the operation involved.
The intent of the lang attribute is to allow user agents to render
content more meaningfully based on accepted cultural practice for a given
language. This does not imply that user agents should render characters that
are atypical for a particular language in less meaningful ways; user agents
must make a best attempt to render all characters,
regardless of the value specified by lang.
For instance, if characters from the Greek alphabet appear in the midst of
English text:
<P><Q lang="en">Her super-powers were the result of
γ-radiation,</Q> he explained.</P>
a user agent (1) should try to render the English content in an appropriate
manner (e.g., in its handling the quotation marks) and (2) must make a best
attempt to render γ even though it is not an English character.
Please consult the section on
undisplayable characters for related information.
The
lang attribute's value is a language code that identifies a natural
language spoken, written, or otherwise used for the communication of
information among people. Computer languages are explicitly excluded from
language codes.
[RFC1766] defines and explains the language codes that must be used in HTML
documents.
Briefly, language codes consist of a primary code and a possibly empty
series of subcodes:
language-code = primary-code ( "-" subcode )*
Here are some sample language codes:
- "en": English
- "en-US": the U.S. version of English.
- "en-cockney": the Cockney version of English.
- "i-navajo": the Navajo language spoken by some Native Americans.
- "x-klingon": The primary tag "x" indicates an experimental language
tag
Two-letter primary codes are reserved for [ISO639] language
abbreviations. Two-letter codes include fr (French), de (German), it (Italian),
nl (Dutch), el (Greek), es (Spanish), pt (Portuguese), ar (Arabic), he
(Hebrew), ru (Russian), zh (Chinese), ja (Japanese), hi (Hindi), ur (Urdu), and
sa (Sanskrit).
Any two-letter subcode is understood to be a [ISO3166] country
code.
8.1.2 Inheritance of language codes
An element inherits language code information according to the following
order of precedence (highest to lowest):
In this example, the primary language of the document is French ("fr"). One
paragraph is declared to be in Spanish ("es"), after which the primary language
returns to French. The following paragraph includes an embedded Japanese ("ja")
phrase, after which the primary language returns to French.
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN"
"http://www.w3.org/TR/html4/strict.dtd">
<HTML lang="fr">
<HEAD>
<TITLE>Un document multilingue</TITLE>
</HEAD>
<BODY>
...Interpreted as French...
<P lang="es">...Interpreted as Spanish...
<P>...Interpreted as French again...
<P>...French text interrupted by<EM lang="ja">some
Japanese</EM>French begins here again...
</BODY>
</HTML>
Note. Table cells may inherit lang
values not from its parent but from the first cell in a span. Please consult
the section on alignment
inheritance for details.
8.1.3 Interpretation of language codes
In the context of HTML, a language code should be interpreted by user agents
as a hierarchy of tokens rather than a single token. When a user agent adjusts
rendering according to language information (say, by comparing style sheet
language codes and lang values), it should always favor an exact match, but
should also consider matching primary codes to be sufficient. Thus, if the
lang attribute value of "en-US" is set for the HTML
element, a user agent should prefer style information that matches "en-US"
first, then the more general value "en".
Note. Language code hierarchies do not guarantee that
all languages with a common prefix will be understood by those fluent in one or
more of those languages. They do allow a user to request this commonality when
it is true for that user.
Attribute definitions
- dir = LTR |
RTL [CI]
- This attribute specifies the base direction of directionally neutral text
(i.e., text that doesn't have inherent directionality as defined in
[UNICODE]) in an element's content and attribute values. It also specifies
the directionality of tables.
Possible values:
- LTR: Left-to-right text or table.
- RTL: Right-to-left text or table.
In addition to specifying the language of a document with the lang
attribute, authors may need to specify the base
directionality (left-to-right or right-to-left) of portions of a
document's text, of table structure, etc. This is done with the dir
attribute.
The [UNICODE] specification assigns directionality to characters and
defines a (complex) algorithm for determining the proper directionality of
text. If a document does not contain a displayable right-to-left character, a
conforming user agent is not required to apply the [UNICODE] bidirectional
algorithm. If a document contains right-to-left characters, and if the user
agent displays these characters, the user agent must use the bidirectional
algorithm.
Although Unicode specifies special characters that deal with text direction,
HTML offers higher-level markup constructs that do the same thing: the dir
attribute (do not confuse with the DIR element) and the BDO
element. Thus, to express a Hebrew quotation, it is more intuitive to write
<Q lang="he" dir="rtl">...a Hebrew quotation...</Q>
than the equivalent with Unicode references:
‫״...a Hebrew quotation...״‬
User agents must not use the lang
attribute to determine text directionality.
The
dir attribute is inherited and may be overridden. Please consult the
section on the inheritance of text direction
information for details.
The following example illustrates the expected behavior of the bidirectional
algorithm. It involves English, a left-to-right script, and Hebrew, a
right-to-left script.
Consider the following example text:
english1 HEBREW2 english3 HEBREW4 english5 HEBREW6
The characters in this example (and in all related examples) are stored in
the computer the way they are displayed here: the first character in the file
is "e", the second is "n", and the last is "6".
Suppose the predominant language of the document containing this paragraph
is English. This means that the base direction is left-to-right. The correct
presentation of this line would be:
english1 2WERBEH english3 4WERBEH english5 6WERBEH
<------ <------ <------
H H H
------------------------------------------------->
E
The dotted lines indicate the structure of the sentence: English
predominates and some Hebrew text is embedded. Achieving the correct
presentation requires no additional markup since the Hebrew fragments are
reversed correctly by user agents applying the bidirectional algorithm.
If, on the other hand, the predominant language of the document is Hebrew,
the base direction is right-to-left. The correct presentation is therefore:
6WERBEH english5 4WERBEH english3 2WERBEH english1
-------> -------> ------->
E E E
<-------------------------------------------------
H
In this case, the whole sentence has been presented as right-to-left and the
embedded English sequences have been properly reversed by the bidirectional
algorithm.
The Unicode bidirectional algorithm requires a base text direction for text
blocks. To specify the base direction of a block-level element, set the
element's
dir attribute. The default value of the dir
attribute is "ltr" (left-to-right text).
When the
dir attribute is set for a block-level element, it remains in effect
for the duration of the element and any nested block-level elements. Setting
the
dir attribute on a nested element overrides the inherited value.
To set the base text direction for an entire document, set the dir
attribute on the HTML element.
For example:
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN"
"http://www.w3.org/TR/html4/strict.dtd">
<HTML dir="RTL">
<HEAD>
<TITLE>...a right-to-left title...</TITLE>
</HEAD>
...right-to-left text...
<P dir="ltr">...left-to-right text...</P>
<P>...right-to-left text again...</P>
</HTML>
Inline elements, on the other hand, do not inherit the dir
attribute. This means that an inline element without a dir
attribute does not open an additional level of embedding with
respect to the bidirectional algorithm. (Here, an element is considered to be
block-level or inline based on its default presentation. Note that the INS and DEL
elements can be block-level or inline depending on their context.)
8.2.3 Setting the direction of embedded text
The [UNICODE] bidirectional algorithm automatically reverses embedded
character sequences according to their inherent directionality (as illustrated
by the previous examples). However, in general only one level of embedding can
be accounted for. To achieve additional levels of embedded direction changes,
you must make use of the dir attribute on an inline element.
Consider the same example text as before:
english1 HEBREW2 english3 HEBREW4 english5 HEBREW6
Suppose the predominant language of the document containing this paragraph
is English. Furthermore, the above English sentence contains a Hebrew section
extending from HEBREW2 through HEBREW4 and the Hebrew section contains an
English quotation (english3). The desired presentation of the text is thus:
english1 4WERBEH english3 2WERBEH english5 6WERBEH
------->
E
<-----------------------
H
------------------------------------------------->
E
To achieve two embedded direction changes, we must supply additional
information, which we do by delimiting the second embedding explicitly. In this
example, we use the SPAN element and the dir attribute to mark up the text:
english1 <SPAN dir="RTL">HEBREW2 english3 HEBREW4</SPAN> english5 HEBREW6
Authors may also use special Unicode characters to achieve multiple embedded
direction changes. To achieve left-to-right embedding, surround embedded text
with the characters LEFT-TO-RIGHT EMBEDDING ("LRE", hexadecimal 202A) and POP
DIRECTIONAL FORMATTING ("PDF", hexadecimal 202C). To achieve right-to-left
embedding, surround embedded text with the characters RIGHT-TO-LEFT EMBEDDING
("RTE", hexadecimal 202B) and PDF.
Using HTML directionality markup with Unicode
characters. Authors and designers of authoring software should be
aware that conflicts can arise if the dir attribute is used on inline
elements (including BDO) concurrently with the corresponding
[UNICODE] formatting characters. Preferably one or the other should be used
exclusively. The markup method offers a better guarantee of document structural
integrity and alleviates some problems when editing bidirectional HTML text
with a simple text editor, but some software may be more apt at using the
[UNICODE] characters. If both methods are used, great care should be
exercised to insure proper nesting of markup and directional embedding or
override, otherwise, rendering results are undefined.
8.2.4 Overriding the bidirectional algorithm: the BDO element
Start tag: required, End tag:
required
Attribute definitions
- dir = LTR
| RTL [CI]
- This mandatory attribute specifies the base direction of the element's text
content. This direction overrides the inherent directionality of characters as
defined in [UNICODE]. Possible values:
- LTR: Left-to-right text.
- RTL: Right-to-left text.
Attributes defined elsewhere
The bidirectional algorithm and the dir attribute generally suffice to
manage embedded direction changes. However, some situations may arise when the
bidirectional algorithm results in incorrect presentation. The BDO
element allows authors to turn off the bidirectional algorithm
for selected fragments of text.
Consider a document containing the same text as before:
english1 HEBREW2 english3 HEBREW4 english5 HEBREW6
but assume that this text has already been put in visual order. One reason
for this may be that the MIME standard ([RFC2045],
[RFC1556]) favors visual order, i.e., that right-to-left character
sequences are inserted right-to-left in the byte stream. In an email, the above
might be formatted, including line breaks, as:
english1 2WERBEH english3
4WERBEH english5 6WERBEH
This conflicts with the [UNICODE] bidirectional
algorithm, because that algorithm would invert 2WERBEH,
4WERBEH, and 6WERBEH a second time, displaying the Hebrew words
left-to-right instead of right-to-left.
The solution in this case is to override the bidirectional algorithm by
putting the Email excerpt in a
PRE element (to conserve line breaks) and each
line in a
BDO element, whose dir attribute is set to
LTR:
<PRE>
<BDO dir="LTR">english1 2WERBEH english3</BDO>
<BDO dir="LTR">4WERBEH english5 6WERBEH</BDO>
</PRE>
This tells the bidirectional algorithm "Leave me left-to-right!" and would
produce the desired presentation:
english1 2WERBEH english3
4WERBEH english5 6WERBEH
The
BDO element should be used in scenarios where absolute control over
sequence order is required (e.g., multi-language part numbers). The
dir attribute is mandatory for this element.
Authors may also use special Unicode characters to override the
bidirectional algorithm -- LEFT-TO-RIGHT OVERRIDE (202D) or RIGHT-TO-LEFT
OVERRIDE (hexadecimal 202E). The POP DIRECTIONAL FORMATTING (hexadecimal 202C)
character ends either bidirectional override.
Note. Recall that conflicts can arise if the dir
attribute is used on inline elements (including BDO) concurrently with the
corresponding [UNICODE] formatting characters.
Bidirectionality and character encoding According to
[RFC1555] and [RFC1556], there are special conventions for the use of
"charset" parameter values to indicate bidirectional treatment in MIME mail, in
particular to distinguish between visual, implicit, and explicit
directionality. The parameter value "ISO-8859-8" (for Hebrew) denotes visual
encoding, "ISO-8859-8-i" denotes implicit bidirectionality, and "ISO-8859-8-e"
denotes explicit directionality.
Because HTML uses the Unicode bidirectionality algorithm, conforming
documents encoded using ISO 8859-8 must be labeled as "ISO-8859-8-i". Explicit
directional control is also possible with HTML, but cannot be expressed with
ISO 8859-8, so "ISO-8859-8-e" should not be used.
The value "ISO-8859-8" implies that the document is formatted visually,
misusing some markup (such as
TABLE with right alignment and no line wrapping)
to ensure reasonable display on older user agents that do not handle
bidirectionality. Such documents do not conform to the present specification.
If necessary, they can be made to conform to the current specification (and at
the same time will be displayed correctly on older user agents) by adding BDO
markup where necessary. Contrary to what is said in
[RFC1555] and [RFC1556], ISO-8859-6 (Arabic) is not
visual ordering.
Since ambiguities sometimes arise as to the directionality of certain
characters (e.g., punctuation), the [UNICODE] specification
includes characters to enable their proper resolution. Also, Unicode includes
some characters to control joining behavior where this is necessary (e.g., some
situations with Arabic letters). HTML 4 includes character references for these characters.
The following DTD excerpt presents some of the directional entities:
<!ENTITY zwnj CDATA "‌"--=zero width non-joiner-->
<!ENTITY zwj CDATA "‍"--=zero width joiner-->
<!ENTITY lrm CDATA "‎"--=left-to-right mark-->
<!ENTITY rlm CDATA "‏"--=right-to-left mark-->
The zwnj entity is used to block joining behavior in contexts
where joining will occur but shouldn't. The zwj entity does the
opposite; it forces joining when it wouldn't occur but should. For example, the
Arabic letter "HEH" is used to abbreviate "Hijri", the name of the Islamic
calendar system. Since the isolated form of "HEH" looks like the digit five as
employed in Arabic script (based on Indic digits), in order to prevent
confusing "HEH" as a final digit five in a year, the initial form of "HEH" is
used. However, there is no following context (i.e., a joining letter) to which
the "HEH" can join. The zwj character provides that context.
Similarly, in Persian texts, there are cases where a letter that normally
would join a subsequent letter in a cursive connection should not. The
character zwnj is used to block joining in such cases.
The other characters, lrm and rlm, are used to
force directionality of directionally neutral characters. For example, if a
double quotation mark comes between an Arabic (right-to-left) and a Latin
(left-to-right) letter, the direction of the quotation mark is not clear (is it
quoting the Arabic text or the Latin text?). The lrm and
rlm characters have a directional property but no width and no word/line
break property. Please consult [UNICODE] for more
details.
Mirrored character glyphs. In general, the
bidirectional algorithm does not mirror character glyphs but leaves them
unaffected. An exception are characters such as parentheses (see
[UNICODE], table 4-7). In cases where mirroring is desired, for example for
Egyptian Hieroglyphs, Greek Bustrophedon, or special design effects, this
should be controlled with styles.
In general, using style sheets to change an element's visual rendering from
block-level to inline or vice-versa is straightforward. However, because the
bidirectional algorithm relies on the
inline/block-level distinction, special care must be taken during the
transformation.
When an inline element that does not have a dir attribute is transformed to
the style of a block-level element by a style sheet, it inherits the dir
attribute from its closest parent block element to define the base direction of
the block.
When a block element that does not have a dir attribute is transformed to
the style of an inline element by a style sheet, the resulting presentation
should be equivalent, in terms of bidirectional formatting, to the formatting
obtained by explicitly adding a
dir attribute (assigned the inherited value) to
the transformed element.
|