13.10.2 Code Set Conversion Framework

13.10.2.1 Requirements

The file code set that an application uses is often determined by the platform on which it runs. In Japan, for example, Japanese EUC is used on Unix systems, while Shift-JIS is used on PCs. Code set conversion is therefore required to enable interoperability across these platforms. This proposal defines a framework for the automatic conversion of code sets in such situations. The requirements of this framework are:

1. Backward compatibility. In previous CORBA specifications, IDL type char was limited to ISO 8859-1. The conversion framework should be compatible with existing clients and servers that use ISO 8859-1 as the code set for char.

2. Automatic code set conversion. To facilitate development of CORBA clients and servers, the ORB should perform any necessary code set conversions automatically and efficiently. The IDL type octet can be used if necessary to prevent conversions.

3. Locale support. An internationalized application determines the code set in use by examining the LOCALE string (usually found in the LANG environment variable), which may be changed dynamically at run time by the user. Example LOCALE strings are fr_FR.ISO8859-1 (French, used in France with the ISO 8859-1 code set) and ja_JP.ujis (Japanese, used in Japan with the EUC code set and X11R5 conventions for LOCALE). The conversion framework should allow applications to use the LOCALE mechanism to indicate supported code sets, and thus select the correct code set from the registry.

4. CMIR and SMIR support. The conversion framework should be flexible enough to allow conversion to be performed either on the client or server side. For example, if a client is running in a memory-constrained environment, then it is desirable for code set converters to reside in the server and for a Server Makes It Right (SMIR) conversion method to be used. On the other hand, if many servers are executed on one server machine, then converters should be placed in each client to reduce the load on the server machine. In this case, the conversion method used is Client Makes It Right (CMIR).

13.10.2.2 Overview of the Conversion Framework

Both the client and server indicate a native code set indirectly by specifying a locale. The exact method for doing this is language-specific, such as the XPG4 C/C++ function setlocale. The client and server use their native code set to communicate with their ORB. (Note that these native code sets are in general different from process code sets and hence conversions may be required at the client and server ends.)

The conversion framework is illustrated in Figure 13-7. The server-side ORB stores a server’s code set information in a component of the IOR multiple-component profile structure (see Section 13.6.2, “Interoperable Object References: IORs? on page 13-14 )1. The code sets actually used for transmission are carried in the service context field of an IOP (Inter-ORB Protocol) request header (see Section 13.7, “Service Context? on page 13-28 and Section 13.10.2.5, “GIOP Code Set Service Context? on page 13-44). Recall that there are two code sets (TCS-C and TCS-W) negotiated for each session.

1. Version 1.1 of the IIOP profile body can also be used to specify the server’s code set information, as this version introduces an extra field that is a sequence of tagged components.

Figure 13-7 Code Set Conversion Framework Overview

If the native code sets used by a client and server are the same, then no conversion is performed. If the native code sets are different and the client-side ORB has an appropriate converter, then the CMIR conversion method is used. In this case, the server’s native code set is used as the transmission code set. If the native code sets are different and the client-side ORB does not have an appropriate converter but the server-side ORB does have one, then the SMIR conversion method is used. In this case, the client’s native code set is used as the transmission code set.

The conversion framework allows clients and servers to specify a native char code set and a native wchar code set, which determine the local encodings of IDL types char and wchar, respectively. The conversion process outlined above is executed independently for the char code set and the wchar code set. In other words, the algorithm that is used to select a transmission code set is run twice, once for char data and once for wchar data.

The rationale for selecting two transmission code sets rather than one (which is typically inferred from the locale of a process) is to allow efficient data transmission without any conversions when the client and server have identical representations for char and/or wchar data. For example, when a Windows NT client talks to a Windows NT server and they both use Unicode for wide character data, it becomes possible to transmit wide character data from one to the other without any conversions. Of course, this becomes possible only for those wide character representations that are well-defined, not for any proprietary ones. If a single transmission code set was mandated, it might require unnecessary conversions. (For example, choosing Unicode as the transmission code set would force conversion of all byte-oriented character data to Unicode.)

13.10.2.3 ORB Databases and Code Set Converters

The conversion framework requires an ORB to be able to determine the native code set for a locale and to convert between code sets as necessary. While the details of exactly how these tasks are accomplished are implementation-dependent, the following databases and code set converters might be used:

• Locale database. This database defines a native code set for a process. This code set could be byte-oriented or non-byte-oriented and could be changed programmatically while the process is running. However, for a given session between a client and a server, it is fixed once the code set information is negotiated at the session’s setup time.

• Environment variables or configuration files. Since the locale database can only indicate one code set while the ORB needs to know two code sets, one for char data and one for wchar data, an implementation can use environment variables or configuration files to contain this information on native code sets.

• Converter database. This database defines, for each code set, the code sets to which it can be converted. From this database, a set of “conversion code sets? (CCS) can be determined for a client and server. For example, if a server’s native code set is eucJP, and if the server-side ORB has eucJP-to-JIS and eucJP-to-SJIS bilateral converters, then the server’s conversion code sets are JIS and SJIS.

• Code set converters. The ORB has converters which are registered in the converter database.

13.10.2.4 CodeSet Component of IOR Multi-Component Profile

The code set component of the IOR multi-component profile structure contains:

• server’s native char code set and conversion code sets, and

• server’s native wchar code set and conversion code sets.

Both char and wchar conversion code sets are listed in order of preference. The code set component is identified by the following tag:

const IOP::ComponentID TAG_CODE_SETS = 1;

This tag has been assigned by OMG (See Section 13.6.6, “Standard IOR Components? on page 13-19.). The following IDL structure defines the representation of code set information within the component:

module CONV_FRAME { // IDL typedef unsigned long CodeSetId; struct CodeSetComponent {

CodeSetId native_code_set;

sequence<CodeSetId> conversion_code_sets; }; struct CodeSetComponentInfo {

CodeSetComponent ForCharData;CodeSetComponent ForWcharData;};};

Code sets are identified by a 32-bit integer id from the OSF Character and Code Set Registry (See Section 13.10.5.1, “Character and Code Set Registry? on page 13-50 for further information). Data within the code set component is represented as a structure of type CodeSetComponentInfo, and is encoded as a CDR encapsulation. In other words, the char code set information comes first, then the wchar information, represented as structures of type CodeSetComponent.

A null value should be used in the native_code_set field if the server desires to indicate no native code set (possibly with the identification of suitable conversion code sets).

If the code set component is not present in a multi-component profile structure, then the default char code set is ISO 8859-1 for backward compatibility. However, there is no default wchar code set. If a server supports interfaces that use wide character data but does not specify the wchar code sets that it supports, client-side ORBs will raise exception INV_OBJREF, with standard minor code 1.

If a client application invokes an operation which results in an attempt by the client ORB to marshal wchar or wstring data for an in parameter (or to unmarshal wchar or wstring data for an in/out parameter, out parameter or the return value), and the associated Object Reference does not include a codeset component, then the client ORB shall raise the INV_OBJREF standard system exception with standard minor code 2 as a response to the operation invocation.

13.10.2.5 GIOP Code Set Service Context

The code set GIOP service context contains:

• char transmission code set, and

• wchar transmission code set

in the form of a code set service. This service is identified by:

const IOP::ServiceID CodeSets = 1;

The following IDL structure defines the representation of code set service information:

module CONV_FRAME { // IDL typedef unsigned long CodeSetId; struct CodeSetContext {

CodeSetId char_data;CodeSetId wchar_data;};};

Code sets are identified by a 32-bit integer id from the OSF Character and Code Set Registry (See Section 13.10.5.1, “Character and Code Set Registry? on page 13-50 for further information).

Note – A server’s char and wchar Code set components are usually different, but under some special circumstances they can be the same. That is, one could use the same code set for both char data and wchar data. Likewise the CodesetIds in the service context don’t have to be different.

13.10.2.6 Code Set Negotiation

The client-side ORB determines a server’s native and conversion code sets from the code set component in an IOR multi-component profile structure, and it determines a client’s native and conversion code sets from the locale setting (and/or environment variables/configuration files) and the converters that are available on the client. From this information, the client-side ORB chooses char and wchar transmission code sets (TCS-C and TCS-W). For both requests and replies, the char TCS-C determines the encoding of char and string data, and the wchar TCS-W determines the encoding of wchar and wstring data.

Code set negotiation is not performed on a per-request basis, but only when a client initially connects to a server. All text data communicated on a connection are encoded as defined by the TCSs selected when the connection is established.

Figure 13-8 illustrates, there are two channels for character data flowing between the client and the server. The first, TCS-C, is used for char data and the second, TCS-W, is used for wchar data. Also note that two native code sets, one for each type of data, could be used by the client and server to talk to their respective ORBs (as noted earlier, the selection of the particular native code set used at any particular point is done via setlocale or some other implementation-dependent method).

Figure 13-8 Transmission Code Set Use

Let us look at an example. Assume that the code set information for a client and server is as shown in the table below. (Note that this example concerns only char code sets and is applicable only for data described as chars in the IDL.)

	Client	Server

Native code set:	SJIS	eucJP
Conversion code sets:	eucJP, JIS	SJIS, JIS

The client-side ORB first compares the native code sets of the client and server. If they are identical, then the transmission and native code sets are the same and no conversion is required. In this example, they are different, so code set conversion is necessary. Next, the client-side ORB checks to see if the server’s native code set, eucJP, is one of the conversion code sets supported by the client. It is, so eucJP is selected as the transmission code set, with the client (i.e., its ORB) performing conversion to and from its native code set, SJIS, to eucJP. Note that the client may first have to convert all its data described as chars (and possibly wchar_ts) from process codes to SJIS first.

Now let us look at the general algorithm for determining a transmission code set and where conversions are performed. First, we introduce the following abbreviations:

• CNCS - Client Native Code Set;

• CCCS - Client Conversion Code Sets;

• SNCS - Server Native Code Set;

• SCCS - Server Conversion Code Sets; and

• TCS - Transmission Code Set.

The algorithm is as follows:

if (CNCS==SNCS) TCS = CNCS; // no conversion required else { if (elementOf(SNCS,CCCS)) TCS = SNCS; // client converts to server’s native code set else if (elementOf(CNCS,SCCS)) TCS = CNCS; // server converts from client’s native code set

else if (intersection(CCCS,SCCS) != emptySet) {

TCS = oneOf(intersection(CCCS,SCCS));

// client chooses TCS, from intersection(CCCS,SCCS), that is

// most preferable to server;

// client converts from CNCS to TCS and server

// from TCS to SNCS

else if (compatible(CNCS,SNCS)) TCS = fallbackCS; // fallbacks are UTF-8 (for char data) and // UTF-16 (for wchar data) else raise CODESET_INCOMPATIBLE exception; }

The algorithm first checks to see if the client and server native code sets are the same. If they are, then the native code set is used for transmission and no conversion is required. If the native code sets are not the same, then the conversion code sets are examined to see if

1. the client can convert from its native code set to the server’s native code set,

2. the server can convert from the client’s native code set to its native code set, or

3. transmission through an intermediate conversion code set is possible.

If the third option is selected and there is more than one possible intermediate conversion code set (i.e., the intersection of CCCS and SCCS contains more than one code set), then the one most preferable to the server is selected.2

If none of these conversions is possible, then the fallback code set (UTF-8 for char data and UTF-16 for wchar data— see below) is used. However, before selecting the fallback code set, a compatibility test is performed. This test looks at the character sets encoded by the client and server native code sets. If they are different (e.g., Korean and French), then meaningful communication between the client and server is not possible and a CODESET_INCOMPATIBLE exception is raised. This test is similar to the DCE compatibility test and is intended to catch those cases where conversion from the client native code set to the fallback, and the fallback to the server native code set would result in massive data loss. (See Section 13.10.5, “Relevant OSFM Registry Interfaces? on page 13-50 for the relevant OSF registry interfaces that could be used for determining compatibility.)

A DATA_CONVERSION exception is raised when a client or server attempts to transmit a character that does not map into the negotiated transmission code set. For example, not all characters in Taiwan Chinese map into Unicode. When an attempt is made to transmit one of these characters via Unicode, an ORB is required to raise a DATA_CONVERSION exception, with standard minor code 1.

In summary, the fallback code set is UTF-8 for char data (identified in the Registry as 0x05010001, “X/Open UTF-8; UCS Transformation Format 8 (UTF-8)"), and UTF-16 for wchar data (identified in the Registry as 0x00010109, "ISO/IEC 10646-1:1993; UTF-16, UCS Transformation Format 16-bit form"). As mentioned above the fallback code set is meaningful only when the client and server character sets are compatible, and the fallback code set is distinguished from a default code set used for backward compatibility.

If a server’s native char code set is not specified in the IOR multi-component profile, then it is considered to be ISO 8859-1 for backward compatibility. However, a server that supports interfaces that use wide character data is required to specify its native wchar code set; if one is not specified, then the client-side ORB raises exception INV_OBJREF, with standard minor code set to 1.

Similarly, if no char transmission code set is specified in the code set service context, then the char transmission code set is considered to be ISO 8859-1 for backward compatibility. If a client transmits wide character data and does not specify its wchar transmission code set in the service context, then the server-side ORB raises exception BAD_PARAM, with standard minor code set to 23.

To guarantee “out-of-the-box? interoperability, clients and servers must be able to convert between their native char code set and UTF-8 and their native wchar code set (if specified) and Unicode. Note that this does not require that all server native code sets be mappable to Unicode, but only those that are exported as native in the IOR. The server may have other native code sets that aren’t mappable to Unicode and those can

2.Recall that server conversion code sets are listed in order of preference.

be exported as SCCSs (but not SNCSs). This is done to guarantee out-of-the-box interoperability and to reduce the number of code set converters that a CORBA-compliant ORB must provide.

ORB implementations are strongly encouraged to use widely-used code sets for each regional market. For example, in the Japanese marketplace, all ORB implementations should support Japanese EUC, JIS and Shift JIS to be compatible with existing business practices.