DOC

On the Development and Deployment of Unicode Based Multilingual ...

By David Sims,2014-09-12 14:03
8 views 0
On the Development and Deployment of Unicode Based Multilingual ...On t

    Unicode and IBM WebSphere

    ??

    Unicode and IBM WebSphere

    On the Development and Deployment of

    Unicode Based Multilingual Web Applications

    in IBM WebSphere Application Server

    Kentaro Noji Debasish Banerjee

    Globalization Center of Competency WebSphere Development

    Yamato Software Laboratory IBM Rochester

    IBM Japan, Ltd. IBM Corporation

    Abstract. With the advent and popularity of the Internet-based e-commerce products, the need to develop multilingual Unicode-based applications is becoming increasingly important. The IBM WebSphere? application server is very well suited for the development and deployment of multilingual Unicode-based applications, both traditional and Web-based. The globalization mechanism embedded in the Web container of the WebSphere application server allows one to develop internationalized Servlets and JSPs to serve documents in any language

    and code set of choice, including Unicode-based multilingual documents. The Web container provides unique features for code set customization and fine-tuning. A system administrator can map language names to code sets of choice, including Unicode, and the IANA code set names of Asian ideographic languages can be fine-tuned to correspond to the Java? Development Kit (JDK) converters of choice. The present paper describes some important

    technical considerations behind the development and deployment of multilingual Unicode-based Java? 2

    Enterprise Edition (J2EE) compliant Web applications. WebSphere's unique globalization mechanism including the code set customization is also explained with accompanying examples of a Servlet and a JSP for serving

    multilingual Unicode-based documents. The ongoing and future internationalization work in WebSphere application server is also highlighted.

    1. Introduction

    The IBM WebSphere? Application Server, Version 4.0, provides a Java? 2 Enterprise Edition

    (J2EE) 1.2 [7] compliant environment for the development and deployment of enterprise applications covering a wide-variety of back-ends and front-ends. Ideally, all the business and presentation logic should use Unicode [11] for uniform and unrestricted processing and representation of characters from any language in the world. Indeed, all the Java? based server-side business components deployed in WebSphere internally use Unicode, and Unicode is the process code set of Java. Unfortunately not all the back-ends (databases, transaction processing monitors, etc.) and frontends (application clients GUIs, browsers, etc.) use Unicode, so they may not have the Unicode handling or presentation capabilities. To interface with legacy applications, WebSphere application components may also have to use native code sets.

    Internet-based eCommerce applications are becoming increasingly popular, and IBM WebSphere, Version 4.0, offers a powerful environment for hosting such applications. The users of an eCommerce application can be located in any country and can potentially use any code set, including Unicode, for communicating with the server-side business logic.

    19th International Unicode Conference 1 San Jose, California, Sept. 2001

    Unicode and IBM WebSphere

    Clearly, a globalized server-side Web application should provide support for multiple code sets, and it should be able to receive and send data in any selected code set including Unicode. IBM WebSphere‘s Web container provides a unique customizable and fine-tunable code set selection

    mechanism for hosting Servlets and JSPs, the two J2EE server-side Web components. The present paper describes the motivation and actual implementation behind this code set selection mechanism, along with appropriate examples.

    Section 2 illustrates a general globalized eCommerce environment. Section 3 describes the code set selection mechanism embedded inside IBM WebSphere‘s Web container. Section 4 contains examples illustrating the code set selection mechanism. Section 5 mentions the future globalization intentions of IBM WebSphere, and finally Section 6 presents our conclusions. A few configuration files and configuration procedures appear in the Appendices.

    2. A Globalized eCommerce Environment

    Figure 1 illustrates a typical large eCommerce deployment scenario, which may have clients and servers situated in various geographically distinct locations. A Web browser can access any Web server application program, and a server-side Web application should be able to communicate with any browser client located anywhere in the world. IBM WebSphere Application Server can naturally assume the role of servers like A, B, C or D.

    Server D English English

    French French

    HTTP XML

     XML

     HTTP/HTTP/SMTPJDBC HTTP SMTP IIOP Japanese Japanese Web App. Server BWeb App. Server A Server C - Database

    - Messageing

    - EJB

    - Web Services Korean

    Korean

    French in Canada French in Canada

    ...Client ...Server

    Figure 1. A large eCommerce deployment scenario

    Servers A and C serve multilingual Web content to the requesting Web clients, while servers B and

    19th International Unicode Conference 2 San Jose, California, Sept. 2001

    Unicode and IBM WebSphere

    D only participate in intra-server communications, and can process and serve multilingual content to other servers. To communicate effectively and reliably in a multilingual environment a receiver should know the code set of the incoming request. If all the server-side components are written in Java, the intra-server communication will take place in Unicode, and no special consideration is needed for code set determination. But for a server like A or C that communicates with clients, it is strictly necessary to determine the input and output code sets associated with requests and responses.

    3. Ascertaining Code Sets in IBM WebSphere

    Servlets and JSPs usually communicate with the clients using the HTTP protocol [2]. This section

    describes the way by which the IBM Web container (Version 4.0) attempts to determine the input and output code sets associated with HTTP-based communications between browser clients and Servlets or JSPs.

    3.1 Code set of an HTTP Request

    HTTP input data can be encoded in any valid IANA[3] code set. Inside a Servlet or a JSP, the HTTP input data is usually obtained by invoking the getParameter() family of methods available in the

    javax.servlet.ServletRequest interface. The entire request body can also be obtained using the java.io.BufferedReader object returned by the

    javax.servlet.ServletRequest.getReader() method. All the above methods return data

    encoded in UCS-2 (Java‘s internal process code set) variant of Unicode, and the Web container has to

    convert the input HTTP data to UCS-2. To perform a proper conversion the Web container has to know the encoding of the input HTTP request so that it can invoke an appropriate JDK converter for conversion to UCS-2.

Theoretically speaking, an HTTP request may have a ‗Content-Type‘ header optionally containing a

    ‗charset‘ attribute. For example, an HTTP client can transmit the header Content-type

    text/html; charset=ISO-8859-2 along with a GET request. The Web container can then easily

    convert the ISO-8859-2 encoded data to UCS-2.

Unfortunately like all the other HTTP headers, this ‗Content-Type‘ header is also optional, and the

    presence of the ‗charset‘ component in a ‗Content-Type‘ header is optional too. In fact, neither

    Netscape nor Microsoft? Internet Explorer, the two most popular browsers, transmit ‗Content-Type‘

    HTTP headers containing any ‗charset‘ attribute. The question naturally arises: In the absence of any explicit code set information in the HTTP request, how can a Web container perform an appropriate UCS-2 conversion?

    Web containers available in the market have followed various ad-hoc strategies to arrive at a value of the input code set, though some of them are arguably wrong. Some of the strategies that we have seen or have heard of are:

If available, use the value of the ‗Accept-Charset‘ HTTP header as the value of the input

    encoding. This approach is incorrect—‗Accept-Charset‘ is not intended to specify the encoding of

    the input request.

     Use the default JDK converter for conversion to UCS-2. The approach assumes the input code

    19th International Unicode Conference 3 San Jose, California, Sept. 2001

    Unicode and IBM WebSphere

    set to be identical to that of the ‗file.encoding‘ system property of the Web container‘s Java?

    Virtual Machine (JVM), and it may not work in multilingual environments. It may also create

    trouble in EBCDIC environments (System/390?).

     Always use the ISO-8859-1 ; UCS-2 converter. Obviously, this approach may not work for

    non-Latin1 clients.

3.2 Deciding on the Input Code Set

    If the input request does not explicitly specify the code set value using the ―Content-Type‖ HTTP

    header, there is no simple but definitive way to arrive at a value of the input encoding. A Web container can only apply heuristic strategies to arrive at a reasonable value of the input code set using indirect avenues. The following sketches the heuristic strategy followed by the IBM Web container. The strategy is divided into four sequential steps. If the Web container decides on the input code step at a particular step, the succeeding steps are skipped.

Step 3.2.1 If the ‗Content-Type‘ HTTP header is present and contains the ‗charset‘

    attribute, the value of the ‗charset‘ attribute is the input code set.

    Step 3.2.2 Try to determine the input code set from the locale associated with the HTTP

    request. The locale of the javax.servlet.http.HttpServletRequest object

    may be determined from the ‗Accept-Language‘ HTTP header [2, 6, 7].

The input locale is mapped to a code set using ―encoding.properties‖, an IBM WebSphere- provided

    properties file for mapping locales to IANA char sets.

Figure 2 illustrates a sample mapping. Appendix A shows a typical ‗encoding.properties‘ file.

    Locale Name IANA Charset Name

    en ISO-8859-1

    cs ISO-8859-2

    ja Shift_JIS

    ko EUC-KR

    zh GB2312

    zh_TW Big5

    Figure 2. Sample mapping rules in encoding.properties

Step 3.2.3 Look for ―default.client.encoding‖, a Web container-specific JVM system property.

    If present, use that value as the input code set.

    Step 3.2.4 As the final recourse, just use ISO-8859-1 as the input code set.

3.3 Deciding on the Output Code Set

    Quite similar to the input request, on the output side, a Servlet has to convert UCS-2 encoded data before sending it to the browsers. If a Servlet or a JSP developer explicitly specifies a ‗charset‘

    19th International Unicode Conference 4 San Jose, California, Sept. 2001

    Unicode and IBM WebSphere

    attribute by invoking the javax.servlet.ServletResponse.setContentType() method, the

    output code set is known. In the absence of a ServletResponse.setContentType()

    invocation, again there is no clear way to arrive at a value for the output code set. To decide the value of the output encoding, the IBM Web container follows the following heuristic strategy. If the Web container decides on the output code step at a particular step, the succeeding steps are skipped.

Step 3.3.1 If the Servlet or JSP developer has explicitly specified a ‗charset‘ attribute, use the

    value of the attribute as the output code set.

    Step 3.3.2 If the Servlet or JSP developer has explicitly invoked

    javax.servlet.ServletResponse.setLocale() API, use

    ―encoding.properties‖ to map the specified locale to a code set.

    Step 3.3.3 Use ISO-8859-1 as the value of the output code set.

    3.4 Fine-Tuning Code Set Converters

    The code set names used in Internet protocols must be registered in the IANA charset database. For certain language environments, the official IANA charset names may have more than one JDK converter associated with them. For example, the most popular code set in Japanese PC environments is Shift-JIS, and there exist a large number of Shift-JIS converters. In fact, JDK

    presently supports Cp943, Cp943C, Cp942, Cp942C, SJIS, and MS932 converters. All of these converters are for UCS-2;;Shift-JIS conversions. These converters are very similar but not

    identical. Figure 3 depicts four variants of

    ―UCS-2 ; Shift_JIS‖ conversions for the

    \u2015\uff5e\u2225\uff0d\uffe4\u2014\u301c\u2016\u2212\u00a6‖ string using the

    native2ascii command of JDK V1.3.

    Figure 3. Sample Conversions

    JDK equates Shift-JIS to MS932, but some Web container installations may want to use Cp943C or SJIS for conversion to or from UCS-2. For fine-tuning the selection of input and output code set converters, IBM WebSphere provides converter.properties, a properties files for mapping IANA

    charset names to JDK converters. Figure 4 depicts a sample mapping, and a typical converter.properties file appears in Appendix A.

    19th International Unicode Conference 5 San Jose, California, Sept. 2001

    Unicode and IBM WebSphere

    IANA Charset Name JDK Converter

    Shift_JIS Cp943C

    EUC-JP Cp33722C

    Figure 4. Sample mapping rules in converter.properties

    To take converter.properties into consideration, the following fine-tuning step is added in our input and output code set determination strategies.

Fine-Tuning Step

    Search converter.properties for a match with the IANA code set name. If there is a match,

    use the corresponding JDK converter for conversions to and from UCS-2; otherwise use the

    original IANA name as the JDK converter.

    3.5 Customization

    The IBM Web container determines the input and output code sets based on the various internationalization configuration parameters as detailed in Sections 3.2, 3.3, and 3.4. All of these internationalization configuration parameters are customizable by system administrators.

    Both ‗encoding.properties‘, the mapping from locale to IANA charset, and ‗converter.properties‘, the mapping from IANA charset to JDK converters, are exposed as properties files, and both can be altered to suit specific Web container installations.

For example, in a Japanese PC-based environment, the ―ja ; Shift_JIS‖ mapping should suffice,

    whereas in a Linux client environment, the mapping should be changed to

    ―ja ; EUC-JP‖. If all the Japanese Web content is encoded in UTF-8, the mapping rule must be changed to ―ja ; UTF-8‖ for that particular installation.

    In a pure Unicode-based environment, all Web input is encoded in UTF-8. The IBM Web container can easily set the input code set to be UTF-8 for specific languages. The system administrator simply has to use the UTF-8 in the ‗encoding.properties‘ file for the appropriate languages. Entries for new locales can also be added easily. The ―default.client.encoding‖ Web container property should be used as a ―catch-all‖, and it is recommended that it be set as UTF-8. The input code set for

    any unusual locale (for example, various Indic locales) will then automatically default to UTF-8.

    Certain environments may need customization of the ―converter.properties‖ file. As mentioned in Section 3.4, in Japanese environments, the Shift_JIS code set corresponds to more than one JVM converter. In fact, Shift-JIS can really be considered to be a vendor unique code set, where the actual character sets and the ―Shift_JIS ;;UCS-2‖ mappings depend on the vendor-specific

    implementations.

    If one needs to follow the JIS (Japanese Industry Standard) or the UTC (Unicode Technical Committee) standard Shift_JIS code set conversion rules, it may suffice to map the Shift_JIS entry of ‗converter.properties‘ to the SJIS converter. As a side effect, some vender specific characters defined in Microsoft? Windows or for the Macintosh may simply disappear. Figure 5 shows some NEC-defined characters, which will be filtered out by JDK‘s SJIS converter.

    19th International Unicode Conference 6 San Jose, California, Sept. 2001

    Unicode and IBM WebSphere

     Figure 5. Some NEC special characters filtered out by Java SJIS converter

    If a particular installation needs to use an IBM-defined code conversion rule, especially for using IBM back-end data storage (DB2?, IMS, etc), Shift_JIS should be mapped to Cp943C, or some important characters may be corrupted in the Web application.

    4. Examples

    This section briefly describes illustrative examples using a Servlet and a JSP serving data in Unicode. The Unicode data is represented as escaped Unicode sequences. The variable unicode_data in Examples 1 and 2 represents arbitrary data from a Shift_JIS database. The unicode_data string is displayed as a Shift_JIS encoding using the IANA charset parameter explicitly specified in the setContentType() call. Figures 6 and 7 show the results as displayed in MS Internet Explorer without and with fine-tuning.

Example 1. Servlet

    public class Sample extends HttpServlet{

     String unicode_data = "\u96fb\u8a71(Phone)\uff17\uff12\uff13\u2212\uff13\uff12\uff15\uff16";

     // unicode_data is an example of a telephone number in Unicode. Normally, a Unicode string is

     // is transmitted via JDBC, HTTP communication and so on. Here we present a simulation using an

     // escaped sequence.

    public void doGet(HttpServletRequest request, HttpServletResponse response)

    throws ServletException, IOException{

    response.setContentType("text/html; charset=Shift_JIS"); // Unicode_data is converted to

    PrintWriter pw = response.getWriter(); // Shift_JIS using JDK converter

    pw.println("");

    pw.println(""); </span></p><p style="line-height:20.17px;letter-spacing:1.51px;">    <span style="font-family:Microsoft YaHei;font-size:12.1px;color:#000000;">pw.println("Sample"); </span></p><p style="line-height:20.17px;letter-spacing:1.51px;">    <span style="font-family:Microsoft YaHei;font-size:12.1px;color:#000000;">pw.println("");

    pw.println(unicode_data);

    pw.println("");

    }

    }

    19th International Unicode Conference 7 San Jose, California, Sept. 2001

    Unicode and IBM WebSphere

    Example 2. JSP

    <%@ page contentType="text/html;charset=Shift_JIS" %>

    

    Sample

    <%

    String unicode_data =

    \u96fb\u8a71(Phone)\uff17\uff12\uff13\u2212\uff13\uff12\uff15\uff16";

    out.println(unicode_data);

    %>

    

    Figure 6. Result of Examples 1 and 2

Without the proper use of ―converter.properties‖ file, the minus sign of the telephone number gets displayed

    as a question mark in Figure 6, because JDK‘s Shift_JIS converter maps the Unicode minus sign to an

    unassigned Shift_JIS code point. But using the ―Shift_JIS ; Cp943C‖ fine-tuning, the telephone number

    gets displayed properly as shown in Figure 7.

    Figure 7. Result of Examples 1 and 2 with fine-tuning

    Figure 8 illustrates an example of the mapping rule to and from Unicode and Shift_JIS families of encodings in Java. The ―MINUS SIGN (0x817C): character name of JIS X0208‖ is frequently used in a database or text

    data, here as the telephone number separator character. The JIS X0208: 1997 standard specifies that the code point of the minus sign is 0x817C in the Shift_JIS encoding. However, the mapping rule differs within the Shift_JIS family of converters in JDK, and sometimes, the minus sign is not preserved in round trips, and is displayed incorrectly (see Figure 8). Using the ‗converter.properties‘ file, IBM WebSphere provides a solution to the Shift_JIS code set conversion problem. It should be mentioned however that, the use of UTF-8 code set for HTTP communication perhaps provides a more elegant solution to the problems associated with UCS-2 conversions in certain Asian ideographic language environments.

    19th International Unicode Conference 8 San Jose, California, Sept. 2001

    Unicode and IBM WebSphere

    Shift_JIS Cp943C

    U+FF0D 817C 817C

    SJIS Cp943C Database

    UDB DB2

    ? Shift_JIS

    U+2212

    Servlet Web Browser

    WebSphere Mac/Win

    817C Minus Sign (JIS X0208:1997;

    U+2212 Minus Sign (Unicode V3.0)

    U+FF0D FullWidth Hifun-Minus (Unicode V3.0)

     Figure 8. Round trip of the ―-‖ sign.

    5. Input Code Set in Servlet 2.3

    The issue of ‗input code determinationof an HTTP request has created some confusion among Web

    container developers. As mentioned in Section 3.1, some Web containers are (or were) following highly questionable strategies for arriving at a value for the input code set. Probably as a result of this, and also for maintaining portability across different Web containers, the emerging Servlet 2.3 specifications [8] has attempted to address the issue of ‗Request data encoding‘ by

     mandating that in the absence of any ‗charset‘ information in the HTTP header, ISO-8859-1 will

    be the default encoding of the HTTP request, and

     introducing the new javax.servlet.ServletRequest.setCharacterEncoding() method.

    Some may think that [8] has simply shifted the complex burden of ‗input code set determination‘ from the Web container developer to the Servlet or JSP programmers. The central question still remains: How can an application programmer developing Servlets or JSPs figure out the input encodings in non-Latin1 multilingual environments in order to know the parameter when calling the newly introduced method?

    In a future release of WebSphere, IBM plans to implement the Servlet 2.3 (and JSP 1.2) specifications. To aid the Servlet (and JSP) programmers, so that they do not have to worry about input encoding in most situations, IBM intends to provide a special deployment descriptor for Servlets and JSPs as a simple extension to the J2EE 1.3 specifications [6]. This deployment descriptor can be described informally as:

    19th International Unicode Conference 9 San Jose, California, Sept. 2001

    Unicode and IBM WebSphere

    The servlet element contains the declarative data for a servlet or a JSP.

    -->

    

    (servlet-class|jsp-file), init-param*, load-on-startup?, run-as?, security-role-ref*, request-encoding?)>

    The request-encoding element must be one of the following:

    J2EE

    IBMWAS

    with J2EE as the default.

    If J2EE is specified, in the absence of any explicit ServletRequest.setCharacterEncoding() API invocation, ISO-8859-1 encoding for the request data will be assumed by the IBM Web container, if the charset' information is also missing in the "Content-Type" HTTP header.

If IBMWAS is specified, in the absence of any explicit

    ServletRequest.setCharacterEncoding() API invocation, the IBM Web container will use steps 3.2.1 to 3.2.4 (see Section 3.2) to decide on the input encoding of the request data. -->

    When available, an application programmer can deploy Servlets and JSPs using IBMWAS in IBM WebSphere. The programmer then

    in most cases will not have to worry about the ‗input code set‘, and can concentrate on the business logic of the application. The IBM Web container will ascertain the input encoding based on its internationalization configuration.

    6. Conclusions

    The present paper described the heuristic strategies used by IBM WebSphere to determine the input and output code sets associated with HTTP requests and responses. The strategies use customizable ‗locale ; code set‘ and ‗code set ; converter‘ mapping tables. The ‗locale ; code set‘

    mapping is also mentioned in [4], and is used in Tomcat‘s [5] Servlet 2.2 implementation for determining the code sets of the HTTP responses.

    In contrast to Tomcat, the IBM Web container‘s use of mapping functions is completely flexible. For example, the ‗ja ; Shift_JIS‘ mapping is hard-wired in Tomcat [5]. In a Japanese Linux or some other environment, if a ‗ja ; EUC-JP‘ mapping is desired for some reason, nothing much can be done in Tomcat without explicit programmer intervention, because the mapping table is compiled into the Web container‘s implementation. In IBM WebSphere, the system administrator can simply make a minor adjustment in the ―encoding.properties‖ file, thereby providing EUC-JP encoded

    19th International Unicode Conference 10 San Jose, California, Sept. 2001

Report this document

For any questions or suggestions please email
cust-service@docsford.com