By Melissa Duncan,2014-08-06 18:11
8 views 0

    Semantic Computation in Chinese Question-Answering System Li Sujian(李素建) and Zhang Jian(张健) and Huang Xiong(黄雄) and Bai Shuo(白硕)

    Software Department, Institute of Computing Technology, Chinese Academy of Sciences,

    P.O.Box 2704, Beijing 100080, P.R.China


    Abstract This paper introduces semantic computation into our Chinese Question-Answering system. Based on two kinds of language resources hownet and Cilin, we present an approach to

    compute the similarity and relevancy between words. Using these results, we can calculate the relevancy between two sentences and then get the optimal answer for the query in the system. The calculation adopts quantitative methods and can be incorporated into QA systems easily, avoiding some difficulties in conventional NLP problems. We finally present the experiment to show that the

    results are satisfying.

    Keywords similarity, relevancy, hownet, Question Answering, Natural Language Processing

摘要 本文介绍了一种实现语义计算并把它结合到中文问题回答(QA)系统中的方法。基于两





    关键字 相似度;关联度;知网;问题回答;自然语言处理

    Hownet is a free Chinese-English bilingual

    resource which is released recently on Internet [2, 1 Introduction

    3, 4]. It is a knowledge base describing relations

    between concepts and relations between the With the explosion of information available on

    attributes of concepts. In our Chinese QA system Internet, Question-Answering system can help us

    we mainly use the knowledge base, which to find what closely matches users needs. Since

    include 66,681 concepts. Every word sense is both questions and answers are mostly expressed

    represented by the combination of several in natural languages, Q/A methodologies have to

    sememes. A sememe is a basic semantic unit that incorporate NLP (Natural Language Processing)

    is indivisible in Hownet. According to the view techniques, including syntactic and semantic

    of ontology, about 1500 sememes are extracted to computation. Due to the encouragement of the

    compose an elementary set which is the basis of Text Retrieval Conference (TREC) and the

    the Chinese glossary, as over 100 kinds of Message Understanding Conferences (MUCs),

    chemical elements constitute all the substances in some QA systems have achieved good

    nature. We describe several definitions in performance [1]. However, these systems mainly

    Hownet as follows: aim at English. In this paper, based on the

    SS{s,s,;,s},n1541characteristics and some language resources, we 12n

    WS{c,c,,c},m66,681build a Chinese Question Answering system 12m through the computation of semantic similarity REL{*,@,?,!,~,#,$,%,^,&,NULL}

    and relevancy. crs,rs,,rs,rREL,sSS(1tk)ii1i1i2i2ikikitit

    where SS represents the set of the sememes

    2 Overview of Language Resources which includes 1,541 elements; WS represents

    the set of the word senses in Hownet whose size is

    66,681; REL is the set which describe relations has embodied synonymous relation and hyponym

    between a concept and a sememe or relations relation and provided convenience for the

    , that expansion and semantic computation of word between sememes. For every word sense ci

    is a concept, its definition is composed by k items, senses. We formalise several definitions as

    each of which includes a relation symbol in REL follows:

    WS'{c,c,,c},m'61,125and a sememe in SS. 12m' In our system, another language resource SC{sc,sc,,sc},p11,83212pavailable is Chinese Thesaurus ?Cilin? [5], Where WS represents the set of word senses in which conducts semantic classification for Cilin, whose size is 61,125, and SC represents the Chinese words. It comprises 12 major categories, set of synsets whose size is 11,832. 94 medium categories, and 1428 minor categories. The two language resources introduced above And the minor categories can be further divided are a great help to our computation in semantic into synsets according to their meanings. Every similarity and relevancy of two Chinese words. synset includes several words with the same or

    similar meanings. This hierarchical classification



    NLP ModuleInformationRetrievalsegmentingSemanticcomputationEntityRetrievalRecognitionSelectionResultsSemanticAnnotation


    Figure 1.System Structure

    the processing of these modules, we can get

    sentences with semantic annotation which can 3 System Description

    enter the module of semantic computation (SC).

    SC module gets the relevancies between sentence At present, the processing mechanism of most

    pairs. Then we select the sentence pairs with the QA systems are based on sentences [6], and at the

    largest value of relevancy. same time, it absorbs the techniques of

    In Figure 1, the thicker the line, the more information retrieval, information extraction and

    information it represents. The language natural language processing [7]. As shown in

    resources include Hownet and Cilin. According Figure 1, for the large quantity of information

    to the characteristics of the Chinese language, we from internet, keywords and mood words such as

    must conduct segmentation for sentences. At the those extracted from queries are inputted to the

    same time or after segmentation, named entity process of Information Retrieval to reduce the

    should also be picked out and semantic scope of searching, and at the same time the

    annotation is conducted for segmented words and sentences whose mode or negative/positive mood

    named entity. The three natural language is not consistent with the query sentence are also

    modules don’t have explicit boundary. Based on filtered out. Then the results obtained and the

    the semantic information collected in the three question needed to query are submitted

    NLP modules, we conduct semantic computation simultaneously to the modules involved in

    between query and relevant sentences. The main natural language processing. These modules

    function of the semantic computation module is include segmentation module, entity recognition

    to get the relevancy value between sentence pairs module, and semantic annotation module. After

    and sort them. This paper mainly discusses the between sememes can be obtained from these

    techniques concerning how to conduct semantic hierarchical trees and based on these relations we

    computation. can compute similarity and associativity between

    sememes within this mechanism. Every node is

    called a main sememe. Every main sememe is 4 Semantic Computation

    followed by some sememes included in the

    square brackets, which we can see as its Semantic computation is the kernel of our system,

    explanation called as explanatory sememes. which is conducted in three steps. The first step

    Every explanatory sememe is ususally preceded is to conduct the computation of the similarity

    by a symbol which describe its relation with the and associativity between sememes. Second,

    main sememe. Both main sememes and their similarity and relevancy between words are

    explanatory sememes have hyponyms and computed; and in the last step, based on the

    hypernyms, thus we can get associativity between results of the two steps above, we can calculate

    sememes in different feature files. It is followed the relevancy between sentences and get the

    that all the sememes in hownet construct a sentence pairs with the maximal value of

    network structure. relevancy.

    In Figure 2, the relation between a main

    sememe and its hypernym or hyponym is called 4.1 Similarity and Associativity between

    as Vertical Relation, we measure sememes with Sememes

    Vertical relations with similarity; other relations In Hownet, the relations among sememes are which span different feature structure are called built through several feature files. The sememes Horizontal Relation which can be measured by in one feature file construct a tree structure. As associativity between sememes. shown in Figure 2, this is a sample structure of

    nodes that belong to the feature files. Relations

    - entity|实体

     ? thing|万物 [#time|时间,#space|空间]

     ? physical|物质 [!appearance|外观]

     ? animate|生物 [*alive|活着,!age|年龄,*die|,*metabolize|代谢]

     ? AnimalHuman|动物 [!sex|性别,*AlterLocation|变空间位置,*StateMental|精神状态]

     ? human| [!name|姓名,!wisdom|智慧,!ability|能力,!occupation|职位,*act|行动]

     ? ? humanized|拟人 [fake|]

     ? animal| [^*GetKnowledge|认知]

     ? beast|走兽 [^*GetKnowledge|认知]


    - event|事件

     ? static|静态

     ? ? relation|关系


    Figure 2. A Sample Tree Structure of Feature Sememes

    For two sememes in the tree structure of /dist (s,s)t(s)t(s)?1212sim(s,s)?12 (1) Figure 2, there exist three possible relations: 0,t(s)t(s)?121.When the two sememes are in different trees, 's,sSS12the similarity will be 0; where, for any two sememes s,s in the sememe 122.the two sememes at least have one common set SS, sim(s,s) represents similarity between s 121ancestral node, but they are in different and s. t(s)=t(s) represents that the two semems 212branches of the ancestral node; are in one tree structure and their similarity is sememe is the ancestral node of the other inversely proportional to their distance. one; Like the structure of Figure 2, the explanatory Then, we compute the similarity between sememes build a bridge for two sememes in sememes as equations in (1): different trees. For example, there should exist

get their similarities and associativities. Because some relation between the sememes ‘animate|

    every word sense is composed of sememes, it’s and ‘alive|活着 which dont have any difficult for hownet to expand the similar or same similarity at all. Here we introduce a new word senses. Now we utilize the second language measure associativity to represent those resource Cilin to make expansion of relations spanning different trees. In doing so, conceptions. As in Figure 3, it is a sample the tree structure becomes a net structure. In order structure of conceptions in Cilin. Every node is a to compute associativities, we need to expand the semantic class. The nearer to the root node, the current sememe in two directions. One is to more abstract the conception that the node expand to the hypernyms of explantory sememe represents. Unlike Hownet, not every node in the which is called Horizontal Associative Expansion structure represents a concrete word sense, and (HAE), the other expansion is to the explantory only the leaf node is a collection of Chinese word sememes of the hypernyms which is called with the same or similar sense. Verticle Associative Expansion (VAE). We Similar to the computation of sememes, we compute associativities according to the have the following equation: equations in (2): /dist(c,c)t'(c)t'(c)?1212sim(c,c)ext(s){s|REL(s,s)}?? (3) 12jiji0,t'(c)t'(c)?12? (2) ;wsim(s,s)?j1jAsso(s,s)wsim(s,s)c,cWS'12ii2sext(s)12j2?sext(s)11?Where c and c are any two word senses in Cilin. 12Where ext(s) is an extension set of the sememe s jit’(c)=t’(c) represents that the two conceptions 12which includes HAE and VAE. We endow a belong to some same semantic class and their weight to every relation in REL which describes similarity is inversely proportional to their how this kind of relation has an influence on the distance. associativity. In computing the associativity Here we adopt a measure relevancy to between s and s, the first part represents the 12represent the associative relation between word associativity between s and extensive set of s; 21senses. The goal of computing the similarity and and the second part is for s and extensive set of 1associativity between sememes is to get the s. 2relevancy of word senses according to the

    equations in (4): 4.2 Similarity and Relevancy between Rele(c,c)Rele(def(c),def(c))?1212Words ?Rele(def(c),def(c))maxRele(s,s)?12ij~?()sdefcj2In section 2 we have introduced two kinds of () (4) sdefc?i1?language resources. For Hownet it is easier to def(c){s|REL(c,s)}ii?construct a net structure for sememes and then to ?Rele(s,s)wsim(s,s);wasso(s,s)ijsijaij?Aa01|mass...Where Rele(c,c) is the relevancy between two 12Aa|Generalword senses c and c, and def(c) is a set of 12Aa02|explanatory sememes for the word sense c. w sfirstpersonand w are the weights of similarity and aAb|sex,ageAa03|A|humanassociativity between sememes respectively, and secondpersonwe can get a relevancy between sememes Ac|buildAa04|Rele(s,s). To get the relevancy of two sets of ijthirdpersonsememes, we pick out the possible sememe pairs B|thingwith maximal value and sum them up. Ad|country...

    4.3 Relevancy between Sentences Ae|profession

    We assume that the filtered sentences s and s 12...have been segmented, resolved anaphorically and C|time&spaceannotated semantically. Then s and s can be 12

    regarded as two sequences of m and n keywords: ...

    Figure 3.A Sample Struture in w w …w and w w …w. Cilin11121m21222n



    Figure 4. Word Pairs in Two Sentences

    To compute the relevancy of a sentence pair, represents the relevancy between the two we use the similarity and relevancy of word pairs. sentences s and s. 12

    We select the word pairs that contribute most to After we get the relevancy of all sentence the relevancy of the sentence pair. The word pairs, we compare their values. The larger the pairs are connected with lines as in the figure 4. value of relevancy, the more relevant the two We use a dynamic programming algorithm to sentences. We get the sentence as the answer of get the relevancy of a sentence pair as the the query that has the largest value of relevancy. following equation:

    ?5 Experiments and Discussion Rele(S,S)M12m,n??;1/dRele(w,w)Sim(w,w)ij1i2j1i2j?The semantic computation contains three steps ?MM0 (5) ?0,ji,0and every step makes use of the computation of ?M1/dlast step. The three steps conform to the 1,11,1??characteristics of the Chinese language: from Mmax{1/dM};i,jiki1,j?1(k(n?morphemes to words to phrases.

    Whereαandβare weights that represent the We did experiments on every step above, and degree that the similarity and relevancy of words the results are satisfying, reflecting the contribute to the relevancy of the sentence pair. correlation between elements in every step. Here thd is the semantic distance between the i word in ijare some examples: Table 1 illustrates the ththe first sentence and the j word in the second similarity and associativity of some example sentence. According to the recursive equation, sememe pairs, and the examples in Table 2 we can finally get the value of M which m,ndemonstrate the similarity and relevancy of some

    word pairs.

    Table 1: example of sememe pairs with their similarity and associativity.

    Sememe1 Sememe2 Sim Sememe1 Sememe2 Asso

    Discuss| Debate| material| Consume| 0.80 0.35 辩论 材料 摄取

    TalkNonsense|Debate| 0.32 0.80 Human| Act|行动 瞎说 辩论

    Throw| Produce| Software| 0.40 0.40 Spread| 制造 软件

    Throw| Compile| Software| 0.533 0.80 cook|吐出 编辑 软件

    Dream| Cool| Planting| FlowerGrass|0.114 0.80 做梦 制冷 栽植 花草

    Mental| Machine| CauseToLive|FlowerGrass|0.267 0.267 精神 机器 使活 花草

    Table 2: example of word pairs with their similarity and relevancy

    Word1 Word2 Sim Word1 Word2 Rele

    致意(give 恰巧(by 0.90 0.0 摇动(shake) 晃动(rock) ones regards) chance)

    实行0.64 0.267 摇动(shake) 移动(move) (smile) (implement)

    医院0.00 51.995 病人(patient) 医院(hospital) 病人(patient) (hospital)

    医生生病 0.410 50.107 医生(doctor) 病人(patient) (physician) (be ill)

    富裕0.64 51.307 医生(doctor) 护士(nurse) 勤劳(diligent) (wealthy)

    懒惰0.512 51.657 揣测(guess) 了解(know) 贫穷(poor) (lazy)

    勤劳贫穷0.90 27.457 揣测(guess) 推想(suppose) (diligence) (poor)

    作者0.64 33.000 反常(abnormal) 奇怪(strange) (write) (author)

In the two tables Table 1 and Table 2, due to the Table 3. Results of several queries in Q-A

    difference of the weights, the quantitative levels Query Relevant Largest score of different measure are different and we should No. sentences Relevancy

    stcompare vertically. 1 1,029 205.127 89 We use the IR module to retrieve 20 relevant nd2 986 232.411 93 documents and extract 50 sentences on average.

    rdSo for every query sentence there are about 1,000 3 997 334.826 92 sentences. We calculate and sort these 1,000 th 41003 602.133 93 relevancy values between the retrieved sentences and the query sentence, and finally get one or th 51002 603.329 91 more sentences with the largest value as answers. We illustrate 5 queries to show the effect of our

    6 Conclusions Q-A system. 93 people were selected to evaluate whether these answers are reasonable. This This paper mainly introduces the application of evaluation is simplified with the following semantic computation in our standard : if one person thinks the answer Question-Answering system. We can compute reasonable, the score is incremented by 1; the similarity and relevancy between words, and otherwise, the score remains unchangeable. get the optimal result by calculating the relevancy Then the maximal score that one answer can get between sentences. Our method conforms to the is 93. In Table 3, the first column represents the characteristics of the Chinese language, No. of one query sentence; the second is the sum combining semantic information with the of the retrieved sentences; the third column computation in three levels and avoiding a lot of represents the largest relevancy which we get by complexities in language processing. At the semantic computation; and the last column same time, the results of the intermediate process, records the score of one answer. such as similarity and associativity between From Table 3, we can see that the answers are sememes, and similarity and relevancy between reasonable for most people. The largest values of word senses, are also very helpful in other relevancy for every query are very different, research fields, e.g. polysemous disambiguation which is because our computation is dependent clustering, and bilingual alignment, to name a on the length and words of one sentence. few.


    The authors would like to thank Dr. Qun Liu, Dr.

Song Lu, and Ms. Yan Liang for their help on this interests include machine learning, information

    work, and also anonymous reviewers for valuable retrieval and data mining. comments on this paper. Huang Xiong received his B.S. and M.S. degrees

    from Peking University in 1992 and 1995, References respectively. He received his Ph.D. degree from

    Beijng University of Aeronautics and [1] E. Voorhees, 1999, The TREC-8 Question

    Astronautics in 1999. From May of 1999 to May Answering Track Report, National Institute of

    of 2001 he conducted research in Institute of Standards and Technology, page 77

    Computing Technology as a post-doctor. His [2] Dong Zhendong, 1999, Hownet,

    major interests lie in analysis and design of

    combinatorial algorithms, computational [3] Zhou Qiang, Feng Songyan, 2000, Building a

    complexity, Web information retrieval and Web relation network representation for how-net,

    application development. Proceedings of 2000 International Conference on

    Bai Shuo received his M.S. and Ph.D degrees of Multilingual Information Processing, Urumqi,

    Computer Science from Peking University China, pp.139-145.

    respectively in 1987 and 1990. Then he [4] Gan K. W., Wong P. W, 2000, Annotating

    conducted research as a post-doctor in information structures in Chinese texts using

    Mathematics Department of Peking University. HowNet. Second Chinese Language Processing

    He has published more than 60 papers in refereed Workshop, Hong Kong, China, pp. 85-92.

    journals and conferences. His research interests [5] Mei Jiaju, 1983, Chinese thesaurus

    are on Computational Linguistics, Natural ?Tongyici Cilin?, Shanghai thesaurus Press.

    Language Processing and Network Security. [6] B. Katz, 1997, From Sentence Processing to Information Access on the World Wide Web, AAAI Spring Symposium on Natural Language Processing for the World Wide Web, Stanford University, Stanford CA.

    [7] Rohini Srihari, Wei Li., 1999. Information

    Extraction Supported Question Answering. (Cymfony Inc.) Proceedings of the 8th Text Retrieval Conference (TREC-8). National Institute of Standards and Technology, Gaithersburg MD.



    Li Sujian received her B.S. degree and M.S. degree in computer science from Shandong University of Technology in 1996 and in 1999 respectively. She is now a candidate doctor and pursues her PH.D degree of computer science at the Institute of Computing Techonolgy, Chinese Academy of Sciences. Her current research interests include machine translation, natural language processing, knowledge discovery, and machine learning.

    Zhang Jian received his B.S. degree in physical oceanography from Ocean University of Qingdao, China in 1998, and his M.S degree in computer science from Institute of Computing Technology, Chinese Academy of Sciences in 2001. Now he is a PH.D. student at School of Computer Science, Carnegie Mellon University. His research

Report this document

For any questions or suggestions please email