As mentioned in earlier, the GNU C library is internationalized according to POSIX and ISO/IEC 14652. Both locales are discussed in this section.
Locale Naming
A locale is described by its language, country and character set. The naming convention as given in OpenI18N guideline26 is:
lang_territory.codeset[@modifiers]
where
- lang is a two-letter language code defined in ISO 639:1988. Three-letter codes in ISO 639-2 are also allowed in the absence of the two-letter version. The ISO 639-2 Registration Authority at Library of Congress27 has a complete list of language codes.
- territory28.
- codeset describes the character set used in the locale.
- modifiers add more information for the locale by setting options (turn on flags or use equal sign to set values). Options are separated by commas. This part is optional and implementation-dependent. Different I18N frameworks provide different options.
For example:
- fr_CA.ISO-8859-1= French language in Canada using ISO-8859-1 character set.
- th_TH.TIS-620 = Thai language in Thailand using TIS-620 encoding.
If territory or codeset is omitted, default values are usually resolved by means of locale aliasing.
Note that for the GNU/Linux desktop, the modifiers part is not supported yet. Locale modifiers for X Window are to be set through the XMODIFIERS environment instead.
Character Sets
Character set is part of locale definition. It defines all characters in a character set as well as how they are encoded for information interchange. In the GNU C library (glibc), locales are described in terms of Unicode.
A new character set is described as a Unicode subset, with each element associated by a byte string to be encoded in the target character set. For example, the UTF-8 encoding is described like this:
... <U0041> /x41 LATIN CAPITAL LETTER A <U0042> /x42 LATIN CAPITAL LETTER B <U0043> /x43 LATIN CAPITAL LETTER C ... <U0E01> /xe0/xb8/x81 THAI CHARACTER KO KAI <U0E02> /xe0/xb8/x82 THAI CHARACTER KHO KHAI <U0E03> /xe0/xb8/x83 THAI CHARACTER KHO KHUAT ...
The first column is the Unicode value. The second is the encoded byte string. And the rest are comments.
As another example, TIS-620 encoding for Thai is simple 8-bit single-byte. The first half of the code table is the same as ASCII, and the second half begins encoding the first character at 0xA1. Therefore, the character map looks like:
... <U0041> /x41 LATIN CAPITAL LETTER A <U0042> /x42 LATIN CAPITAL LETTER B <U0043> /x43 LATIN CAPITAL LETTER C ... <U0E01> /xa1 THAI CHARACTER KO KAI <U0E02> /xa2 THAI CHARACTER KHO KHAI <U0E03> /xa3 THAI CHARACTER KHO KHUAT ...
POSIX Locales
According to POSIX, standard C library functions are internationalized according to the following categories:
| Category | Description |
|---|---|
| LC_CTYPE | character classification |
| LC_COLLATE | string collation |
| LC_TIME | date and time format |
| LC_NUMERIC | number format |
| LC_MONETARY | currency format |
| LC_MESSAGES | messages in locale language |
Setting Locale
A C [language] application can set current locale with the setlocale() function (declared in locale.h). The first argument indicates the category to be set; alternatively, LC_ALL is used to set all categories. The second argument is the locale name to be chosen, or alternatively empty string (��) is used to rely on system environment setting.
Therefore, the program initialization of a typical internationalized C program may appear as follows:
#include... const char *prev_locale; prev_locale = setlocale (LC_ALL, ��);
and the system environments are looked up to determine the appropriate locale as follows:
- If LC_ALL is defined, it shall be used as the locale name.
- Otherwise, if corresponding values of LC_CTYPE, LC_COLLATE, LC_MESSAGES are defined, they shall be used as locale names for corresponding categories.
- For categories that are still undefined by the above checks, and LANG is defined, this is used as the locale name.
- For categories that are still undefined by the above checks, �C� (or �POSIX�) locale shall be used.
The �C� or �POSIX� locale is a dummy locale in which all behaviours are C defaults (e.g. ASCII sort for LC_COLLATE).
LC_CTYPE
LC_CTYPE defines character classification for functions declared in
- iscntl()
- isgraph()
- isprint()
- isspace()
- ispunct()
- isalnum()
- isalpha()
- isdigit()
- isxdigit()
- islower()
- isupper()
- tolower()
- toupper()
Since glibc is Unicode-based, and all character sets are defined as Unicode subsets, it makes no sense to redefine character properties in each locale. Typically, the LC_CTYPE category in most locale definitions refers to the default definition (called �i18n�).
LC_COLLATE
C functions that are affected by LC_COLLATE are strcoll() and strxfrm().
- strcoll() compares two strings in a similar manner as strcmp() but in a locale-dependent way. Note that the behaviour strcmp() never changes under different locales.
- strxfrm() translates string into a form that can be compared using the plain strcmp() to get the same result as when directly compared with strcoll()
.
The LC_COLLATE specification is the most complicated of all locale categories. There is a separate standard for collating Unicode strings, called ISO/IEC 14651 International String Ordering29. The glibc default locale definition is based on this standard. Locale developers may consider investigating the Common Tailorable Template (CTT) defined there before beginning their own locale definition.
In the CTT, collation is done through multiple passes. Character weights are defined in multiple levels (four levels for ISO/IEC 14651). Some characters can be ignored (by using �IGNORE� as weight) at first passes and be brought into consideration in later passes for finer adjustment. Please see ISO/IEC 14651 document for more details.
LC_TIME
LC_TIME allows localization of date/time strings formatted by the strftime() function. Days of week and months can be translated into the locale language, appropriate date
LC_NUMERIC & LC_MONETARY
Each culture uses different conventions for writing numbers, namely, the decimal point, the thousand separator and grouping. This is covered by LC_NUMERIC.
LC_MONETARY defines currency symbols used in the locale as per ISO 4217, as well as the format in which monetary amounts are written. A single function localeconv() in locale.h is defined for retrieving information from both locale categories. Glibc provides an extra function strfmon() in monetary.h for formatting monetary amounts as per LC_MONETARY, but this is not standard C function.
LC_MESSAGES
LC_MESSAGES is mostly used for message translation purposes. The only use in POSIX locale is the description of a yes/no answer for the locale.
ISO/IEC 14652
The ISO/IEC 14652 Specification method for cultural conventions30 is basically an extended POSIX locale specification. In addition to the details in each of the six categories, it introduces six more:
| Category | Description |
|---|---|
| LC_PAPER | paper size |
| LC_NAME | personal name format |
| LC_ADDRESS | address format |
| LC_TELEPHONE | telephone number |
| LC_MEASUREMENT | measurement units |
| LC_VERSION | locale version |
All of the above categories have already been supported by glibc. C applications can retrieve all locale information using the nl_langinfo() function.
Building Locales
To build a locale, a locale definition file describing data for ISO/IEC 14652 locale categories must be prepared (See the standard document for the file format). In addition, when defining a new character set, a charmap file must be created for it; this gives every character a symbolic name and describes encoded byte strings.
In general, glibc uses UCS symbolic names (<Uxxxx>) in locale definition, for convenience in generating locale data for any charmap. The actual locale data to be used by C programs is in binary form.The locale definition must be compiled with the localedef command, which accepts arguments like this:
localedef[-f<charmap>] [-i <input>]For example, to build th_TH locale from locale definition file th_TH using TIS-620 charmap:
# localedef -f TIS-620 -i th_TH th_THThe charmap file may be installed at /usr/share/i18n/charmaps directory, and the locale definition file at /usr/share/i18n/locales directory, for further reference.
The locale command can be used with �-a� option to check for all installed locales and �-m� option to list supported charmaps. Issuing the command without argument shows the locale categories selected by environment setting.
26Library of Congress, ISO 639-2 Registration Authority; available from lcweb.loc.gov/standards/iso639-2.
27ISO, ISO 3166 Maintenance agency (ISO 3166/MA) � ISO�s focal point for country codes; available from www.iso.org/iso/en/prodsservices/iso3166ma/index.html.
28ISO/IEC, ISO/IEC JTC1/SC22/WG20 � Internationalization; available from anubis.dkuug.dk/jtc1/sc22/wg20.
29ISO/IEC, ISO/IEC JTC1/SC22/WG20 � Internationalization; available from anubis.dkuug.dk/jtc1/sc22/wg20.
30Ibid.

Technorati Tags: 




Post new comment