By Patricia Harper,2014-08-27 20:31
18 views 0
word-filedoc ...

Journal of Chinese Language and Computing 16 (4): 225-237 225

    A New Dictionary-based Word Alignment Algorithm

    ***Zhiming Xu , Jonathan J. Webster and Chunyu Kit *Department of Chinese, Translation and Linguistics, City university of Hong Kong,

    Tat Chee Ave., Kowloon, Hong Kong

    School of Computer Science and Technology, Harbin Institute of Technology, P. R.C.


    Box 321, Harbin Institute of Technology, Harbin, China 150001


    This paper presents our recent work on dictionary-based word alignment for the English-Chinese bilingual corpus of Hong Kong legal texts. It is intended to address a number of critical issues involved in word alignment, including monolingual pre-processing for identifying text items for alignment, proper formulation of similarity measure for bilingual word pairs, and the novel treatment of many-to-many word alignment. A dictionary-based word alignment algorithm (DictAlign) is proposed and tested with the bilingual texts of Hong Kong laws. Experimental results show that this approach achieves an impressive performance: 95.70% precision and 68.39% coverage rate for English, and 95.73% precision and 64.87% coverage rate for Chinese.


word alignment, word similarity

1. Introduction

    The main task of word alignment is to identify translation equivalents between lexical items in bilingual parallel texts (or bitexts). In addition to individual words, i.e., single-word units, the lexical items to be aligned also include multi-word units, such as

226 Zhiming Xu, Jonathan J. Webster and Chunyu Kit

    compound words, terms, idiomatic expressions and some text chunks of other types. The lexical items for alignment are referred to as link units (Ahrenberg, Magnus, et al, 2000; Tiedemann, 1999). Bilingual word alignment usually assumes a segment-aligned bitext as input, where bitext segments may be mutual translations of clauses, sentences, and sentence sequences (Tiedemann, 1999). In general word alignment work involves the following typical tasks: (1) acquisition of necessary resources, including bilingual dictionaries, bilingual terminology and monolingual stop (or functional) word lists; (2) segmentation of monolingual sentences into link units; and (3) alignment of bilingual link units in a given bitext.

    Word alignment systems may aim at different purposes, e.g., full-text word alignment vs. bilingual lexicon extraction (Ahrenberg, Magnus, et al, 2000). Our recent work, as part of an ongoing research project on EBMT for Hong Kong legal texts, is aimed at full-text word alignment. The outcomes of this work are expected to facilitate text alignment at higher linguistic levels, such as phrase/chunk alignment and sentence alignment, and translation selection (Och and Ney, 2003; Knight, 1999; Wu, 1996).

    Our word alignment work addresses both single-word and multi-word link units. The multi-word link units sometimes appear in corpus in discontinuous form (parts of them are separated by some determinative, adjectives and so on). It is not surprising that they are segmented into fragments. Also, because of the imperfect segmentation of link units, mostly due to segmentation ambiguities, continuous multi-word units may be segmented into fragments. Problems of this kind make the alignment of these link units more complicated than the one-to-one word alignment. Once a pair of multi-word link units are both segmented into several fragments, the many-to-many alignment problem arises, although the alignment of other types, including one-to-one, one-to-many and many-to-one alignment, are more popular in practice.

    In this paper we will present our recent work on (1) the segmentation of English and Chinese monolingual Hong Kong legal texts into link units and (2) the word alignment on these bilingual link units. The remaining sections of the paper are organized as follows. Section 2 introduces our probabilistic approach to link unit segmentation. Section 3 formulates an empirical similarity measure for identifying word pairs of mutual translation with the aid of an available bilingual dictionary. Section 4 proposes a dictionary-based word alignment algorithm, called DictAlign. It copes with all types of alignment of discontinuous link units. Section 5 presents our experimental results on the BLIS corpus, a complete collection of English-Chinese bilingual texts of Hong Kong laws, using this algorithm. It achieves an alignment precision of 97.7%, with a coverage rate of 64.6% and

A New Dictionary-based Word Alignment Algorithm 227

    67.0%, respectively, for the Chinese and English texts in the input bilingual corpus. Section 6 concludes the paper and discusses possible directions for future work.

2. Identification of monolingual ling units

2.1 Pre-processing

    The proper segmentation of monolingual texts into link units is an essential task to word alignment if we intend to extend word alignment to handle multi-word link units in addition to single-word link units. Two typical preprocessing tasks are involved in this stage of work for the identification of such monolingual link units, namely, lemmatization and tokenization.

    In our work, the process of lemmatization takes place with the preprocessing of English texts, in order to convert variant word forms back to their lemmas, or base forms. Our lemmatization module exploits a number of morphological rules to carry out the conversion for regular nouns and verbs. Irregular nouns and verbs are handled individually with the aid of a mapping table of irregular word forms and their lemmas.

    Tokenization is a process to recognize text tokens for further processing. Here tokens can be viewed as atomic text units that form the larger text units such as words for later processing. They are not to be further decomposed into smaller units. In our work, a token in Chinese is a Chinese character of fixed length (2 bytes) it is worth noting that the

    tokenization here for Chinese does not involve word segmentation; a token in English is a single word, which is generally delimited by spaces and punctuations. A link unit in both languages is a sequence of one or more than one token. For example, in the pair of link units 香港”/ “Hong Kong”, “香港” consists of two tokens “” and “”, and “Hong

    Kong ” of two tokens “Hong” and “Kong”.

2.2 Statistical approach to link unit identification

After the pre-processing phase, each sentence in an input text is converted into a ST

    ntoken sequenceStttt??. The task of statistical link unit segmentation is to find a 121n

    mlink unit sequence over this token sequence Suuuu??that maximizes the 121m

228 Zhiming Xu, Jonathan J. Webster and Chunyu Kit

    mprobability. We use a link unit-based unigram as language model and pu()1

    mmdefine, where , and is the frequency of the pununu()()/()nu()pupu()()(iii1iuT1i

    link unit in. uTi

    The language model is trained on the unsegmented corpus using EM algorithm T

    (Dempster, 1977). The training process is similar to the EM-training presented in Kit et al. (2003) for probabilistic Chinese word segmentation. We train unigram models for English and Chinese using the BLIS corpus, and utilize them respectively to identity link units in each language.

3. Bilingual word similarity

    A machine-readable bilingual dictionary is the most critical lexical resource in our dictionary-based approach to word alignment. In general, word alignment approaches can be categorized into two types: statistical-based, as illustrated, for example, in Brown et al. (1993) and Melamed (2000), and dictionary-based, as we practice here. We expect that a dictionary-based word alignment approach can achieve a better performance than statistical word alignment methods, because the lexical resources that it makes use of is more reliable than pure statistical information. The lexical resources that DictAlign uses include a Chinese-English dictionary of frequent words and a bilingual term list.

    Given a bilingual sentence pair, each as a sequence of monolingual link units, the (,)ST

    task of word alignment is to identify as many correspondent pairs of link units as possible in the sentence pair. For each link unit in source-language sentence, we want to find sS

    sits counterpart in the target sentence. This counterpart of is called its in-context T

    translation (ICT), denoted as . On the other hand, its dictionary translation as icts()

    extracted from an available bilingual dictionary is called its dictionary translation (DT), denoted as (Ker and Chang, 1997). dts()

    For any bilingual link unit pair in, we need to estimate the likelihood that (,)st(,)ST

     is mutual translation. We can estimate such likelihood in terms of their bilingual (,)st

    word similarity, denoted as . Statistical measures, such as mutual information, simst(,)

    dice coefficients and other statistical metrics, are commonly used to estimate the statistical association strength or the distribution similarity of a word pair. Here we take a

    229 A New Dictionary-based Word Alignment Algorithm

    tdictionary-based approach to estimate the string similarity between and for the icts()

    purpose of determining whether are mutual translation. (,)st

    String similarity can be computed in terms of the longest common subsequence ratio (LCSR), as illustrated in Tiedemann (1999) for the purpose of determining whether a word pair is a cognate pair. Here, we also resort to LCSR for estimating the string similarity

    tbetween and . With regard to the symmetry in word alignment, we define the icts()

    bilingual word similarity of a link unit pair as follows: (,)st

    LCSRsdttLCSRdtst(,())((),)simst(,) (1) 2

    lengthLongestCommonSubsequencexy((,))where. gives a score simst(,)LCSRxy(,)max((),())lengthxlengthy

    to indicate likelihood that represents a mutual translations. is symmetrical, (,)stsimst(,)

    i.e., . To avoid unexpected noises for the word alignment, if the text simstsimts(,)(,)

    strings in the pair or do not include one another, then we (,())sdtt((),)dtst

    set. simst(,)0

4. The DictAlign algorithm

    Source ssssss1 2 3 4 5 6

    Target tttttt1 2 3 4 5 6

    Figure 1. Direct link digraph of word alignment

Figure 1 shows an bilingual sentence pair, where and (,)STSsss{,,,}12m

    , each being a set of content-word link units after the monolingual link Tttt{,,,}12n

    unit segmentation phase and filtering out functional words. DictAlign, our dictionary-based word alignment algorithm, is designed to find as many proper pairs of link units as possible in the product set. ST

230 Zhiming Xu, Jonathan J. Webster and Chunyu Kit

4.1 Baseline algorithm of word alignment

    In order to evaluate the performance of the DictAlign algorithm by comparison, we firstly implement a one-to-one word alignment algorithm as a baseline, using dice coefficients as similarity measure. It is given as follows.

    Given a pair of sentences , each as a sequence of content-word link units, (S,T)

    S{s,s,;,s}T{t,t,;,t}, and , the similarity for each sim(s,t)12m12n

    link unit pair in is first computed. (s,t)ST

    sim(s,t)0(s,t)argmaxsim(s,t)(1) Select the link unit pair , if . If ijij(s,t)ST

    not available, then exit.

    (s,t)TT{t}SS{s}(2) Align , and let , . ijji

    (3) Go to (1).

4.2 DictAlign word alignment algorithm

    The bilingual dictionary used in our research subsumes a bilingual glossary of Hong Kong laws, with many multi-word terms that sometimes appear in a discontinuous form in the corpus. It is also quite common that even the continuous ones are segmented into several fragments by the process of link unit segmentation, mostly due to segmentation ambiguities. If a multi-word unit and its ICT are both in fragments, we have a many-to-many mapping problem in our alignment work. If only one of the two counterparts is in fragments, we have a one-to-many (or many-to-one) mapping problem, a special case of many-to-many mapping. According to our observation, such special cases take place more often than the many-to-many alignment, which takes place occasionally. In order to decide whether two given sets of such fragments should be aligned with each other we need to (,)setset12

    setsss{,}setttt{,}calculate their similarity. Assuming and, each 1mmm2nnn12k12l

    being a sequence of segmented fragments, we define their similarity as follows:

    simsetsetsimsssttt(,)(,) (2) 12mmmnnn1212kl

    owhere denotes string concatenation. The computation involved in (2) is identical to that for (1), except the concatenation before the computation. It is worth noting, however, that

    231 A New Dictionary-based Word Alignment Algorithm this formula is effective in handling fragmental in-vocabulary (IV) link units, not any out-of-vocabulary (OOV) ones. Given a candidate pair of fragmental link units, if one is IV

    in this language and the other OOV in another language, they may only be partially aligned using (2).

    Firstly, to align fragmental link units, we need to construct the map set for mapu()

    Teach link unit in or. For each, sSSi

    ; For each tT, mapstTsSsimstsimst(){|,(,)(,)}?;;;!jii

    tmaptsStTsimstsimst(){|,(,)(,)}?;;;!. That is, a link unit in takes maps()jj

    ts?,s as the most possible. Their relation is denoted by a direct link: , as in ictt()

    ts?,Figure 1. The length of is defined as. Once the map set of each link unit simts(,)

    Tin or is constructed, we select the longest direct link st?, from all the direct Sij

    links in , with simst(,)0. If no direct link available, we are done; otherwise, for STij

    a selected direct linkst?,, we seek the pair of ij

    mapu()(,)argmax(,)setsetsimxyelements, where denotes the power set of 212mapt()maps()jixy;,(,)22

    . Then we align , and let , . Next we mapu()(,)setsetSSset??TTset??1212

    Tre-construct the map set of each unaligned ling unit in or, and iterate the above S

    alignment process until no more alignment is available.

    With the above process, to tackle many-to-many word alignment problem, the DictAlign algorithm is formulated as follows:

    Given a pair of sentences(S,T), each as a sequence of content-word link units,

    S{s,s,;,s}T{t,t,;,t}, and , the similarity sim(s,t) for each 12m12n

    link unit pair in is first computed. (s,t)ST

    tTsS(1) For each link unit and, construct their map sets. ji

    sim(s,t)0(s,t)argmaxsim(s,t)(2) Select the link unit pair , if . If ijij(s,t)ST

    not available, then exit.

    (3) Select the element pair . (set,set)argmaxsim(x,y)12map(t)map(s)ji(x,y)22

232 Zhiming Xu, Jonathan J. Webster and Chunyu Kit

    (set,set)SSsetTTset(4) Align , and let , . 1212

    (5) Go to (1).

    We gave an example to illustrate how DictAlign works step by step. Here is a pair of bilingual sentences as followings.

    English: Get the children up.

    Chinese: 孩子起床

    During monolingual link unit identification phase, the English sentence may be segmented into a few of fragments, “get / the / children / up/.”, “get up” appears in incontinuous form. How can we recognize “get” and “up ” as a whole multi-word unit, and

    align them with “起床”? Compared with the statistical approach, the dictionary knowledge is more reliable. In this paper, we use the dictionary knowledge to distinguish if some fragments consist of a multi-word unit so that we filter out impossible fragment combinations.

    The pair of sentences is segmented into a sequence of fragments

    English: Get /the /children /up/.

    Chinese: /孩子/起床/

    Their fragments will generate a pair of link unit sets:

    S = {“get”, “the”, “children”, “up”}.

    T = {“”, “孩子”, “起床”}

    DictAlign will align them along the following steps.

    1. Using Equation.1, we compute word similarity between any pair of link units; these similarity values construct a similarity matrix on. ST

    2. Select a pair of link units that are of maximum word similarity. Here, we assume similarity (“起床”, “up”) is maximum, because DictAlign is kind of

    best-first algorithm, it will select the pair of words,“起床”and “up” to align them firstly.

    3. Construct the map set for each element of S or T by use of the following equations.

    map(c){eT|;cS,sim(c,e)sim(c,e)} For ;cSii

    map(e){cS|;eT,sim(c,e)sim(c,e)} For ;eTjj

    sEach link unit in map(s) takes as the most possible . For example, ict(t)t

    Map(“起床”) = {“get”, “up”}

    Map(“up”) = {“起床”}

    map(u)4. Construct the power set of the map (u), and denote it as. It constructs the space 2

    of possible fragment combinations; we select a pair of fragment combinations from the

    233 A New Dictionary-based Word Alignment Algorithm space, and align them.

    map("起床")The power set of Map(“起床”), 2 {{“get”}, { “up” }, { “get”, “up” }}.

    map("up")The power set of Map(“up”), {{ “起床”}}. 2

    map("起床")map("up")Now, we will select the link unit pair (u,v) from , which maximizes 22

    similarity (u,v). here, (u,v) = ({ “get”, “up” },{ “起床”}).

    5. Finally, we align ({ “get”, “up” },{ “起床”}), and move them into aligned result. Then

    iterate the above process.

5. Experimental results

    This section presents our experimental results using the DictAlign algorithm on English-Chinese bilingual texts of Hong Kong laws. The bilingual dictionary we use includes 134,497 bilingual word pairs and 25214 bilingual legal term pairs. A stop list is used to filter out function (or non-content) words in each language. The training corpus consists of 238,271 aligned bilingual sub-paragraph pairs from BLIS corpus, among which many are subparagraph legal text items. In total, 38MB English texts and 23MB Chinese texts are used, respectively, to train the unigram models for link unit segmentation for the two languages. A sub-corpus of the first 205 sub-paragraph pairs is selected as test data to evaluate the performances of the DictAlign algorithm and the baseline word alignment method. Their performance is presented in Table 1 for comparison. We can see that DictAlign achieves a significantly higher precision than the baseline, although its coverage rate is lower. The majority of alignment errors that DictAlign makes are resulted from partial-matching. On the contrary, the majority of alignment errors by the baseline alignment are mismatches. The precision and coverage rate of word alignment are defined as follows.

    numberofcorrectlyalignedcontentlinkunitsintestcorpus (3) precisionnumberofalignedcontentlinkunitsintestcorpus

    numberofalignedcontentlinkunitsintestcorpuscoveragerate (4) numberofcontentlinkunitsintestcorpus

234 Zhiming Xu, Jonathan J. Webster and Chunyu Kit

    Table 1: The alignment performance: DictAlign vs. the baseline

    Algorithm DictAlign Baseline

    Language English Chinese English Chinese

    Words 4970 5506 (4970) (5506) Test corpus Link units 1838 1950 (1838) (1950)

    Aligned link 1257 1265 1747 1747


    Partial-matching 40 40 36 36 Experimental

    results Mismatching 14 14 262 262

    Coverage rate 68.39% 64.87% 95.05% 89.59%

    Precision 95.70% 95.73% 83.00% 83.00%

    The example given below illustrates how the link unit segmentation and the word alignment are carried step by step.

    (1) Input: a pair of un-segmented sub-paragraphs

     本條例旨在綜合和修訂有關法例的釋疑?適用範圍?釋義的法律; 設立




     To consolidate and amend the law relating to the construction, application and interpretation of laws, to make general provisions with regard thereto, to define terms and expressions used in laws and public documents, to make general provision with regard to public officers, public contracts and civil and criminal proceedings and for purposes and for matters incidental thereto or connected therewith.

    (2) Link unit segmentation result


    |法律| |設立|關於|這些|事宜||一般條文|||法例||公共文件|中的|




     |to|consolidate|and|amend|the|law|relating to|the|construction|,|application|and |interpretation|of|law|,|to|make general provision|with|regard|thereto|,|to|define|terms|and |expressions|used|in|law|and|public|documents|,|to|make general provision|with regard to|public|officers|,|public contract|and|civil|and|criminal proceedings|and|for|purposes|and |for|matter|incidental|thereto|or|connected|therewith|.

    (3) Word alignment output from DictAlign

Report this document

For any questions or suggestions please email