Previous | Table of Contents | Next |
This section introduces a few terms and explains a few concepts to help understand the character processing portions of this
document.
13.10.1.1 Character Set
A finite set of different characters used for the representation, organization, or control of data. In this specification,
the term “character set? is used without any relationship to code representation or associated encoding. Examples of character
sets are the English alphabet, Kanji or sets of ideographic characters, corporate character sets (commonly used in Japan),
and the characters needed to write certain European languages.
13.10.1.2 Coded Character Set, or Code Set
A set of unambiguous rules that establishes a character set and the one-to-one relationship between each character of the
set and its bit representation or numeric value. In this specification, the term “code set? is used as an abbreviation for
the term “coded character set.? Examples include ASCII, ISO 8859-1, JIS X0208 (which includes Roman characters, Japanese hiragana,
Greek characters, Japanese kanji, etc.) and Unicode.
13.10.1.3 Code Set Classifications
Some language environments distinguish between byte-oriented and “wide characters.? The byte-oriented characters are encoded
in one or more 8-bit bytes. A typical single-byte encoding is ASCII as used for western European languages like English. A
typical multi-byte encoding which uses from one to three 8-bit bytes for each character is eucJP (Extended UNIX Code - Japan,
packed format) as used for Japanese workstations.
Wide characters are a fixed 16 or 32 bits long, and are used for languages like Chinese, Japanese, etc., where the number
of combinations offered by 8 bits is insufficient and a fixed-width encoding is needed. A typical example is Unicode (a “universal?
character set defined by the The Unicode Consortium, which uses an encoding scheme identical to ISO 10646 UCS-2, or 2-byte
Universal Character Set encoding). An extended encoding scheme for Unicode characters is UTF-16 (UCS Transformation Format,
16bit representations).
The C language has data types char for byte-oriented characters and wchar_t for wide characters. The language definition for
C states that the sizes for these characters are implementation-dependent. Some environments do not distinguish between byte-oriented
and wide characters (e.g., Ada and Smalltalk). Here again, the size of a character is implementation-dependent. The following
table illustrates code set classifications as used in this document.
Table 13-3 Code Set Classification
Orientation |
Code Element Encoding |
Code Set Examples |
C Data Type |
||
byte-oriented | single-byte | ASCII, ISO 8859-1 (Latin-1), EBCDIC, ... | char | ||
multi-byte | UTF-8, eucJP, Shift-JIS, JIS, Big5, ... | char[] | |||
non-byteoriented | fixed-length | ISO 10646 UCS-2 (Unicode), ISO 10646 UCS-4, UTF-16, ... | wchar_t |
13.10.1.4 Narrow and Wide Characters
Some language environments distinguish between “narrow? and “wide? characters. Typically the narrow characters are considered
to be 8-bit long and are used for western European languages like English, while the wide characters are 16-bit or 32bit long
and are used for languages like Chinese, Japanese, etc., where the number of combinations offered by 8 bits are insufficient.
However, as noted above there are common encoding schemes in which Asian characters are encoded using multi-byte code sets
and it is incorrect to assume that Asian characters are always encoded as “wide? characters.
Within this specification, the general terms “narrow character? and “wide character? are only used in discussing OMG IDL.
13.10.1.5 Char Data and Wchar Data
The phrase “char data? in this specification refers to data whose IDL types have been specified as char or string. Likewise
“wchar data? refers to data whose IDL types have been specified as wchar or wstring.
13.10.1.6 Byte-Oriented Code Set
An encoding of characters where the numeric code corresponding to a character code element can occupy one or more bytes. A
byte as used in this specification is synonymous with octet, which occupies 8 bits.
13.10.1.7 Multi-Byte Character Strings
A character string represented in a byte-oriented encoding where each character can occupy one or more bytes is called a multi-byte
character string. Typically, wide characters are converted to this form from a (fixed-width) process code set before transmitting
the characters outside the process (see below about process code sets). Care must be taken to correctly process the component
bytes of a character’s multi-byte representation.
13.10.1.8 Non-Byte-Oriented Code Set
An encoding of characters where the numeric code corresponding to a character code element can occupy fixed 16 or 32 bits.
13.10.1.9 Char and Wchar Transmission Code Set (TCS-C and TCS-W)
These two terms refer to code sets that are used for transmission between ORBs after negotiation is completed. As the names
imply, the first one is used for char data and the second one for wchar data. Each TCS can be byte-oriented or non-byte oriented.
13.10.1.10 Process Code Set and File Code Set
Processes generally represent international characters in an internal fixed-width format which allows for efficient representation
and manipulation. This internal format is called a “process code set.? The process code set is irrelevant outside the process,
and hence to the interoperation between CORBA clients and servers through their respective ORBs.
When a process needs to write international character information out to a file, or communicate with another process (possibly
over a network), it typically uses a different encoding called a “file code set.? In this specification, unless otherwise
indicated, all references to a program’s code set refer to the file code set, not the process code set. Even when a client
and server are located physically on the same machine, it is possible for them to use different file code sets.
13.10.1.11 Native Code Set
A native code set is the code set which a client or a server uses to communicate with its ORB. There might be separate native
code sets for char and wchar data.
13.10.1.12 Transmission Code Set
A transmission code set is the commonly agreed upon encoding used for character data transfer between a client’s ORB and a
server’s ORB. There are two transmission code sets established per session between a client and its server, one for char data
(TCS-C) and the other for wchar
data (TCS-W). Figure 13-6 illustrates these relationships:
transmission
native native
ORB
ORB
code sets
code set code set
Figure 13-6 Transmission Code Sets
The intent is for TCS-C to be byte-oriented and TCS-W to be non-byte-oriented. However, this specification does allow both
types of characters to be transmitted using the same transmission code set. That is, the selection of a transmission code
set is orthogonal to the wideness or narrowness of the characters, although a given code set may be better suited for either
narrow or wide characters.
13.10.1.13 Conversion Code Set (CCS)
With respect to a particular ORB’s native code set, the set of other or target code sets for which an ORB can convert all
code points or character encodings between the native code set and that target code set. For each code set in this CCS, the
ORB maintains appropriate translation or conversion procedures and advertises the ability to use that code set for transmitted
data in addition to the native code set.