Annex B. Localization : Technical Aspects (Unicode, Planes)

In this annex, more technical details will be discussed. The aim is to give implementers necessary information to start localization. However, this is not intended to be a hands-on cookbook.

Unicode

As a universal character set that includes all characters of the world, Unicode assigns code points to its characters by 16-bit integers, which means that up to 65,536 characters can be encoded. However, due to the huge set of CJK (Chines, Japanese and Korean)characters, this has become insufficient, and Unicode 3.0 has extended the index to 21 bits, which will support up to 1,114,112 characters.

Planes

Unicode code point is a numeric value between 0 and 10FFFF, divided into planes of 64K characters. In Unicode 4.0, allocated planes are Plane 0, 1, 2 and 14.

Plane 0, ranging from 0000 to FFFF, is called Basic Multilingual Plane (BMP), which is the set of characters assigned by the previous 16-bit scheme.

Plane 1, ranging from 10000 to 1FFFF and called Supplementary Multilingual Plane (SMP), is dedicated to lesser used historic scripts, special-purpose invented scripts and special notations. These include Gothic, Shavian and musical symbols. Many more historic scripts may be encoded in this plane in the future.

Plane 2, ranging from 20000 to 2FFFF and called Supplementary Ideographic Plane (SIP), is the spillover allocation area for those CJK characters that cannot fit into the blocks for common CJK characters in the BMP. Plane 14, ranging from E0000 to EFFFF and called Supplementary Special-purpose Plane (SSP), is for some control characters that do not fit into the small areas allocated in the BMP.

There are two more reserved planes Plane 15 and Plane 16, for private use, where no code point is assigned.

Basic Multilingual Plane

Basic Multilingual Plane (BMP), or Plane 0, is most commonly in general documents. Code points are allocated for common characters in contemporary scripts with exactly the same set as ISO/IEC 10646-1, as summarized in Figure 2 (below) Note that the code points between E000 and F900 are reserved for the vendors� private use. No character is assigned in this area.

Character Encoding

There are several ways of encoding Unicode strings for information interchange. One may simply represent each character using a fixed size integer (called wide char), which is defined by ISO/IEC 10646 as UCS-2 and UCS-4, where 2-byte and 4-byte integers are used6, respectively and where UCS-2 is for BMP only. But the common practice is to encode the characters using variable-length sequences of integers called UTF-8, UTF-16 and UTF-32 for 8-bit, 16-bit and 32-bit integers, respectively7. There is also UTF-7 for e-mail transmissions that are 7-bit strict, but UTF-8 is safe in most cases.

  • UTF-32

    UTF-32 is the simplest Unicode encoding form. Each Unicode code point is represented

    directly by a single 32-bit unsigned integer. It is therefore, a fixed-width character encoding form. This makes UTF-32 an ideal form for APIs that pass single character values. However, it is inefficient in terms of storage for Unicode strings.

  • UTF-16

    UTF-16 encodes code points in the range 0000 to FFFF (i.e. BMP) as a single 16-bit unsigned integer.

Code points in supplementary planes are instead represented as pairs of 16-bit unsigned integers. These pairs of code units are called surrogate pairs. The values used for the surrogate pairs are in the range D800 � DFFF, which are not assigned to any character. So, UTF-16 readers can easily distinguish between single code unit and surrogate pairs. The Unicode Standard8 provides more details of surrogates.

UTF-16 is a good choice for keeping general Unicode strings, as it is optimized for characters in BMP, which is used in 99 percent of Unicode texts. It consumes about half of the storage required by UTF-32.

UTF-8 To meet the requirements of legacy byte-oriented ASCII-based systems, UTF-8 is defined as variable-width encoding form that preserves ASCII compatibility. It uses one to four 8-bit code units to represent a Unicode character, depending on the code point value. The code points between 0000 and 007F are encoded in a single byte, making any ASCII string a valid UTF-8. Beyond the ASCII range of Unicode, some non-ideographic characters between 0080 and 07FF are encoded with two bytes. Then, Indic scripts and CJK ideographs between 0800 and FFFF are encoded with three bytes. Supplementary characters beyond BMP require four bytes. The Unicode Standard9 provides more detail of UTF-8.

UTF-8 is typically the preferred encoding form for the Internet. The ASCII compatibility helps a lot in migration from old systems. UTF-8 also has the advantage of being byte-serialized and friendly to C or other programming languages APIs. For example, the traditional string collation using byte-wise comparison works with UTF-8.

In short, UTF-8 is the most widely adopted encoding form of Unicode.

Character Properties

In addition to code points, Unicode also provides a database of character properties called the Unicode Character Database (UCD)10, which consists of a set of files describing the following properties:

  • Name.
  • General category (classification as letters, numbers, symbols, punctuation, etc.).
  • Other important general characteristics (white space, dash, ideographic, alphabetic, non character, deprecated, etc.).
  • Character shaping (bidi category, shaping, mirroring, width, etc.).
  • Case (upper, lower, title, folding; both simple and full).
  • Numeric values and types (for digits).
  • Script and block.
  • Normalization properties (decompositions, decomposition type, canonical combining class, composition exclusions, etc.).
  • Age (version of the standard in which the code point was first designated).
  • Boundaries (grapheme cluster, word, line and sentence).
  • Standardized variants.

The database is useful for Unicode implementation in general. It is available at the Unicode.org Web site. The Unicode Standard11 provides more details of the database.

Technical Reports

In addition to the code points, encoding forms and character properties, Unicode also provides some technical reports that can serve as implementation guidelines. Some of these reports have been included as annexes to the Unicode standard, and some are published individually as Technical Standards.

In Unicode 4.0, the standard annexes are:

  • UAX 9: The Bidirectional Algorithm

    Specifications for the positioning of characters flowing from right to left, such as Arabic or Hebrew.

  • UAX 11: East-Asian Width

    Specifications of an informative property of Unicode characters that is useful when interoperating with East-Asian Legacy character sets.

  • UAX 14: Line Breaking Properties

    Specification of line breaking properties for Unicode characters as well as a model algorithm for determining line break opportunities.

  • UAX 15: Unicode Normalization Forms

    Specifications for four normalized forms of Unicode text. With these forms, equivalent text (canonical or compatibility) will have identical binary representations. When implementations keep strings in a normalized form, they can be assured that equivalent strings have a unique binary representation.

  • UAX 24: Script Names

    Assignment of script names to all Unicode code points. This information is useful in mechanisms such as regular expressions, where it produces much better results than simple matches on block names.

  • UAX 29: Text Boundaries

    Guidelines for determining default boundaries between certain significant text elements: grapheme clusters (�user characters�), words and sentences.

The individual technical standards are:

  • UTS 6: A Standard Compression Scheme for Unicode

    Specifications of a compression scheme for Unicode and sample implementation.

  • UTS 10: Unicode Collation Algorithm

    Specifications for how to compare two Unicode strings while conforming to the requirements of the Unicode Standard. The UCA also supplies the Default Unicode Collation Element Table (DUCET) as the data specifying the default collation order for all Unicode characters.

  • UTS 18: Unicode Regular Expression Guidelines Guidelines on how to adapt regular expression engines to use Unicode.

All Unicode Technical Reports are accessible from the Unicode.org web site12.

6UCS is the acronym for Universal multi-octet coded Character Set.

7UTF is the acronym for Unicode (UCS) Transformation Format.

8The Unicode Consortium. The Unicode Standard, Version 4.0., pp. 76�77.

9The Unicode Consortium. The Unicode Standard, Version 4.0., pp. 77�78.

10Ibid., pp. 95�104.

11Unicode.org, �Unicode Technical Reports�; available from www.unicode.org/reports/index.html.

12Unicode.org, �Unicode Technical Reports�; available from www.unicode.org/reports/index.html.


Post new comment

The content of this field is kept private and will not be shown publicly.
  • Lines and paragraphs break automatically.
  • Easily link to terms in various wikis. For help, see <a href="/interwiki/3">interwiki</a>.

More information about formatting options