writing system information



Complex-text Languages

In the languages of the western world based on the Latin, Cyrillic and Greek scripts, there is no difference between how text is stored for data processing and how it is presented on a display or a printer. The text is read on horizontal lines from left to right, the lines progress from top to bottom and the characters are stored in a manner identical to how they are presented.

Not all the languages of the world have these characteristics.

In this document, complex-text languages are defined as those languages for which the text has a different layout when presented from when it is stored for data processing. The term layout, which is equivalent, in this context, to the term format, refers to the shape of the characters and the direction of portions of the text.

An additional characteristic of complex-text languages (with the exception of Vietnamese) is the fact that they do not have upper case or lower-case characters.

Typical complex-text languages are those with a bi-directional script. Usually they are written from right to left, with some portions of text, such as numbers and embedded Latin-based text, written from left to right. Bi-directional languages include the languages of the Middle East and Africa (Arabic, Hebrew, Urdu, Farsi, Yiddish, and so on). Other complex-text languages include some languages of Asia that do not limit their encoding to a double-byte scheme (Thai, Lao, Vietnamese, Korean, and so on).

There is nothing in these languages themselves that is more complex than in the Latin-based languages; they are special only in that the presented text does not necessarily look identical to the text as stored.

Though the term complex is used to describe the text of the bi-directional and some other Asian languages, enabling a program to work in these languages is relatively simple, once the peculiarities of these languages are understood.

Layout Transformations and Related Attributes

To enter, process and present a text in a complex-text language, it is necessary to perform transformations between the processing layouts and the presentation layouts. The processing layout is the layout of text when stored or processed. The presentation layout is the layout of text when presented on a display or a printer.

These transformations have to take into account specific text attributes, including directionality, shaping, composition of characters and national numbers. Text attributes that describe bi-directional writing systems are defined in Bi-directional Languages.

An internationalized application must be designed to deal automatically with this kind of transformation and related attributes.

Bi-directional Languages

The bi-directional languages are used mainly in the Middle East. They include Arabic, Urdu, Farsi, Hebrew and Yiddish.1

In a bi-directional language, the general flow of text proceeds horizontally from right to left, but numbers are written from left to right, the same way as they are written in English. In addition, if an English or another left-to-right language text (addresses, acronyms or quotations) is embedded, it is also written from left to right.

Aspects of Bi-directional Language Writing Systems

This section discusses aspects of bi-directional texts, related to directionality, shaping and national numbers as well as keyboard input and compliance with common user access guidelines. The text attributes described here also pertain to some degree to other complex-text languages such as the languages of Asia (for example, Thai, Lao, Korean).

Bi-directionality

In the context of bi-directionality, the following are key concepts:
Segments
Global orientation
Text-types and associated reordering methods
Symmetrical swapping.
These attributes are described below.

Segments

A bi-directional text may consist of a main part that has one directionality (for example, an Arabic text written from right to left), and portions that have an opposite directionality (for example, an English address written from left to right.) The portion of text with a different directionality is called a segment. A bi-directional text thus might have a body bicolor="#FFFFFF" of right-to-left text with embedded left-to right segments. Sometimes a segment with one directionality might itself have embedded or nested within it an additional segment with an opposite directionality. It is conceptually possible to have many levels of
nesting; in most cases, however, there are no more than two levels.

One level of nesting is necessary for the entry of numbers within Arabic or Hebrew text. To simulate bi-directional scripts in the following examples, Hebrew and Arabic text is