By Bonnie Carpenter,2014-07-04 01:25
13 views 0

Hanzi-to-Pinyin/Zhuyin Converter (H2X)

1 Overview and Goals

H2X is a Hanzi-to-Pinyin (H2P) conversion system for SC, and a Hanzi-to-Zhuyin (H2Z)

    conversion system for TC. Eventually it can be expanded to other romanization systems, such

    as Yale and Wade-Giles. Collectively, we will refer to such a hanzi transcription system as

    H2X, where X stands for any phonemic transcription such as pinyin, zhuyin or romanized

    Cantonese. Below X will often be referred by the term reading.

H2X can be used for, among others:

    ? Aid native speakers in reading difficult names or characters

    ? Aid learners to read Chinese texts

    ? Enable ambiguous search based on homophones (explained below)

    ? Sort hanzi by pinyin or zhuyin (useful for name lists and the like)

2 Conversion Ambiguity

An obvious and major issue with H2X is the one-to-many ambiguity of thousands of

    characters, the so called polyphonic hanzi (多音字duōyīnzì), such as and yuè for ,

    resulting in numerous homophones. The disambiguation strategy for accurate H2X

    conversion is to tokenize the text so as to isolate individual words, then to look up in

    word-level hanzi mapping table, which almost completely eliminates ambiguity. This requires

    the following components

1. Simple word tokenizer

    2. Word-level H2X mapping tables

    3. Character level H2X mapping tables

There are two kinds of homophonic ambiguity (the implications of which are described


    1. Homotonic: reading and tone are identical, such a 网陆 (resulting from input

    errors) and 网路, both wǎnglù.

    2. Heterotonic: reading identical but tone different, such as 网炉 wǎnglú (input error)

    and 网路 wǎnglù.

Note that for the purposes of converting to the correct reading, the tokenizer need not

    be as robust as for other applications since the goal is not to extract tokens per se, but

    to segment just accurately enough so that the correct reading is determined. Thus the

    H2X tokenizer can be based on a simplified tokenization algorithm, which CJKI can

    provide, independently of the main tokenizer.

3 Ambiguous Search

H2X conversion on both the homotonic and heterotonic levels can have a major benefit:

    enabling ambiguous search as well as retrieval of documents even if the keywords are input

    erroneously, such as 网陆 for 网路. The system should thus support four conversion modes:

    1. Toneless pinyin for SC

    2. Toned pinyin for SC

    3. Toneless zhuyin for TC

    4. Toned zhuyin for TC

This means that if the search engine is properly tuned it will retrieve not only homotonic

    homophone pairs like网路/网陆 wǎnglù, but also heterotonic pairs like 网路/网炉

    (wǎnglù vs. wǎnglú), in which the tones are different but the readings are identical.

     4 Features of H2X Converter

The system should eventually have the following capabilities/features:

1. The source string is first extracted by the (simple) tokenizer.

    2. The string is looked up in a comprehensive word-level H2X mapping table covering

    general vocabulary, proper nouns and technical terms.

    3. Support both query errors and document errors.

    4. Character level H2X mapping tables. The readings (pinyin or zhuyin) have been

    proofread and fine tuned over the years and include the following features:

    a. The first reading has been carefully selected to ensure it is the most common.

    b. The order of the other readings in the case of one-to-many mappings is based on

    frequency of use.

    c. Rarity flags enable selecting a mode in which rare and historical readings are

    ignored so as to reduce ambiguity (at the slight risk of error).

    d. Possibly, provide flags to indicate order of priority when a reading is used in

    names as opposed to general vocabulary.

    e. A major feature is that SC readings are clearly distinguished from TC readings

    when they are heterotonic, e.g. SC as opposed to TC (zhuyin for qí)

    for . See details at:

5. The H2X conversion algorithm that (eventually) supports the following features:

    a. Word level conversion

    b. Character level conversion

    c. Picklist of candidates in case of one-to-many ambiguities

    d. Option to ignore rare/historical readings

    e. Possibly, option to fine tune output to proper nouns

    f. Select SC or TC reading

    g. Output in zhuyin

    h. Output in any romanization system, such as Wade-Giles and Yale.

    i. Output in IPA broad transcription.

5 Resources for H2X Conversion

    CJKI can provide the following comprehensive mapping tables and robust algorithm for H2X conversion:

1. SC-to-pinyin word mapping table

    2. TC-to-zhuyin word mapping table

    3. SC/TC to/from pinyin/zhuyin character mapping table 4. H2X conversion algorithm with advanced options

    For your reference, if in the future you wish to support Hanzi-to-Cantonese conversion, we can also provide the following:

1. Hanzi-to-Cantonese mapping table

    2. Mapping table for eight Cantonese romanization systems 3. Hanzi-to-Cantonese conversion algorithm

Report this document

For any questions or suggestions please email