An Efficient Centroid Based Chinese Web Page
Liu Hui Peng RanYe Shaozhi Li Xing
ABSTRACT technologies coming out every year. The reasons may be, In this paper, we present an efficient centroid based Chinese firstly there still exist many problems to be addressed to apply web page classifier that has achieved satisfactory performance an algorithm to a system design and implementation, and on real data and runs very fast in practical use. Except for its secondly, many algorithms attained perfect results on public clear design, this classifier has some creative features: Chinese corpora, but failed to achieve satisfactory results on real data. word segmentation and noise filtering technology in Considering web page categorization, a subfield of text 2preprocessing module; combined Statistics feature selection ?categorization, the task is more difficult, since there is a great method; adaptive factors to improve categorization diversity among the web pages in terms of document length, performance. Another advantage of this system is its optimized style, and content. Another aspect of a web page is its fast implementation. Finally we show performance results of update status, with the topic, trend and sometimes even experiments on a corpus from Peking University of China, and writing style changing quickly with time. At the same time, some discussions. corpora of web page are less than professional document or
high-quality content corpora that are used mainly for Digital
Library or algorithm testing. But there are also some Categories and Subject Descriptors advantages of web pages. For example, web pages have special G.3 [Probability and Statistics]: Probabilistic algorithms –structures and links, which provide useful information of their 2Statistics, Mutual Information; I.5.1 [Pattern Recognition]: ?classes, and several systems have exploited this feature to Models – Statistical; I.5.2 [Pattern Recognition] Design classify web pages.  Methodology – Classifier design and evaluation, Feature
Turning our attention further to the Chinese Web Page evaluation and selection; I.5.4 [Pattern Recognition]
Categorization, we could easily find another two obstacles: the Applications –Text processing. need of segmentation and the lack of Chinese corpora. Chinese
web page classification mainly adopts algorithms like k-General Terms Nearest Neighbor, Naïve Beyes, Centroid based method and Algorithms, Design, Experimentation, Performance etc, with some exploitations of hyperlinks and structures
information to improve classification accuracy.  Keywords In this paper we present the detailed introduction of a Chinese 2Text categorization, Centroid based, Feature Selection, ?web page categorization system that has excellent statistics, Chinese word segmentation performances on real data at a very fast speed. Our approach
has many advantages such as a clear system design, Combined 2Statistics Feature Selection, optimized implementation and ?1. INTRODUCTION
some other new features. We live in a world of information explosion. The phenomenal
growth of the Internet has resulted in the availability of huge We carried out our experiments on a corpus provided by amount of online information. Therefore the ability to catalog Peking University of China, and we attended the Chinese web and organize online information automatically by computers is thpage categorization competition held by them on March 15, highly desirable. Automatic text categorization is such a 2003. This corpus is available to public, and many Chinese research field that begins with the emergence of Digital classification systems have done experiments on it, hence the Library, flourished in Web environment, and increasingly result could be compared and retested. became a common network application.
We have laid out the rest of the paper as follows. In Section 2, Numerous text categorization algorithm and classifier systems we outline the system architecture, and briefly introduce the have emerged during the past twenty years. Nowadays the function of each module, which helps to understand the whole most popular algorithms are Rocchio algorithm , Naïve process. In Section 3, we give detailed information on the new Bayes method , Support Vector Machine , Boosting 2features of our system, among which combined Statistics ?method , k-Nearest Neighbor  and so on. Comparatively,
algorithm is introduced and analyzed. Sections 4 show some practical classification systems, especially those can provide
tricks of our system‟s implementation. In Section 5 we present stable services are much less than the new classification
experimental results and some analysis. At last Section 6 Copyright is held by the author/owner(s) provides conclusions and future plan. Asia Pacific Advanced Network 2003, 25-29 August 2003, Busan, Republic of Korea. Network Research Workshop 2003, 27 August 2003, Busan, Republic of Korea.
1 This classifier ranked first at Chinese web page categorization competition during the meeting of National Symposium on Search ththEngine and Web Mining, which was hosted by Peking University of China on March 14 –15, 2003. Further information about the
meeting please refer to http://net.cs.pku.edu.cn/~sedb2002/.
2. ARCHITECTURE the test vector and the centroid vector of each class, and then The system is divided into two parts: the training part and the choosing one or two classes as the decision. We just use vector testing part. Training part functions as reading the prepared dot product as the similarity measure. training samples, implementing a series preprocessing, and Feedback This module aims to make use of testing statistics to then extracting characteristics of predefined classes and model improve classifier applicability for practical data. Adaptive for class decision. Testing part is used to label input testing factors will be optimized through user feedback. Correctly samples by the model generated in training part, and after this decided document could be added as new training sample, and process, some statistics are generated to evaluate the thus may revise feature set. performance of the classifier. The testing result affect training
part by the feedback module, which could utilize testing 3. ADVANCED FEATURES statistics to adjust some empirical parameters and the weight This classifier has some new features compared with other of selected features. This makes the classifier more adaptable systems that we have studied. Especially for Chinese web page to the changing data source, especially web. classification, this classifier shows good performance as high
precision and very fast speed. For each module of the system,
whose main function has been introduced scarcely in, we have
done many-detailed work to test its intermediate result and
tried our best to improve its performance.
3.1 Preprocessing Techniques
As we have discussed above, word stems is regarded as
features, the representation units of text. However, unlike
English and other Indo-European languages, Chinese texts do
not have a natural delimiter between words. As a consequence, word segmentation becomes one of the major issues of
preprocessing. Another problem is caused by the particular Figure 1. Architecture of classifier
characteristics of Web pages, the large diversification of their Preprocessing To extract useful information from web page format, length and content. We also see a lot of hyperlinks in structure, generate attribute-value form texts, and fulfill the web pages. Some are related to the content of web pages, and Chinese word segmentation. During this process, the statistical others, such as advertisement, are not. We call the irrelevant data about the training set is recorded into a database, information “noise”, and it is certain that in web pages there including keyword and its Term Frequency and Inverse exists an amount of noise that must be filtered before Document Frequency (TF-IDF) , and each text is saved in a document representation. new keyword-frequency form. Here frequency refers to times
of this keyword appearing in this single text. 3.1.1 Chinese word segmentation Feature Selection Originally each keyword is looked on as a Not using the traditional dictionary that set up by professionals, feature in the text space, and thus each text can be represented we establish our own dictionary based on log data from search as a point in such space. Due to the computation difficulty that engine. We extract words and phrases from log data and huge dimension causes and considerable amount of redundant exploit the query times as their frequency. This dictionary information that bad features bring with, feature selection is contains 17000 word or phrases, and  said that considering necessary. In this system, a novel combined 2both word and phrase together will effectively enhance statistics ?categorization precision, and also guarantee independency method is adopted, and we choose features in subspace of each between different features. As an illustration, “搜索引擎 class. (search engine)”will not be segmented as “搜索 (search) ”Vector Generation We represent a text using Vector Space and“引擎 (engine) ”, thus “搜索引擎”will be treated as a Model (VSM), so that training samples will be single feature in text space. It is a very interesting way that we transformed as a vector in subspace of their own class, and use dictionary obtained from web service to analyze web pages, testing samples in subspace of every class. Each dimension is a so that it has better quality than others in adaptability to the keyword chosen as a feature. The weight of each feature is changing web, simplicity to manage and update, and computed by the popular TF-IDF equation . categorization accuracy. For segmenting word, we use
Maximum Matching Method , which is to scan a sequence Model Generation To calculate the centroid of each class. of Chinese letters and returns the longest matched word. Centroid means the vector that has the same dimension with Although it is simple and not as accurate as more advanced sample texts and can represent a special class. The Model file method such as Association-Backtracking Method, its fast contains parameters as class number, feature dimension, speed attracts us and we find its result satisfactory on our training set size, and the centroids. dictionary. Formatting & Segmentation To extract information from
HTML structured file and segment the Chinese paragraph into 3.1.2 Noise filtering word sequence. This work is done using our own Chinese Firstly, stop word, those common words that have not tendency Segmentation algorithm and dictionary, which will be to any class, such as empty word and pronoun, will be removed. introduced in the later part. Our Chinese stop word list is mainly from Chinese statistical Classification To decide which class a new vector should dictionary, combined with high-frequency words in web pages belong to. This is done by computing the similarities between as „copyright‟, „homepage‟ and etc; secondly advertising links
3.2.2 Mutual Information (MI) should be deleted, which is usually appeared in commercial The idea of this method is to measure how dependent a word sites and homepages. After studying main commercial sites in and a class on each other by Mutual Information. According to China, such as Sina (http://www.sina.com.cn), 263 the definition of Mutual Information, this method can be (http://www.263.net) and Sohu (http://www.sohu.com.cn), we expressed as equation (2). find a rule that the length of most advertising or unrelated
P(t|c)P(t,c)links are relatively shorter than related links (see Figure 2). rr (2) I(t,c)?log?logAnd the keywords of advertising links are usually in a limited P(t)P(t)?P(c)rrrset. Based on above research, we set 10 as the threshold length
of link. If a link is shorter than 10 letters or it contains It is interesting that MI has two properties, which happen to 2keywords listed in the above limited set, it will be considered compensate for limitations of statistics method. One is for ?as noising link and discarded. high frequency word in other class but not the studied class,
is low and is high, so that is I(t,c)P(t|c)P(t)rr
comparatively small; the other is for words with the same
, those with higher total frequency will be given lower P(t|c)r
23.2.3 Combined statistics ?
2The above analysis shows that statistics and Mutual ?
Information algorithm have complementary properties, so we
put forward a combined feature selection method that has Figure 2. Comparison of related and unrelated link demonstrated its advantages in our system. 2?3.2 Combined Statistics Feature 2The combined statistics can be formulated as equation (3). ?Selection 2 (3) W(t,c)????(t,c)?(1??)?I(t,c)0???1We adopt a new feature selection algorithm that combines 2Statistics method with Mutual Information (MI) method, ? is the combined weight of word t to class c, and we W(t,c)and this combination successfully retained the merit of each use this weight to select features for each class. In our system algorithm and compensate for their limitations. the best result is achieved when happen to be 0.93. ?
2?3.2.1 Statistics 3.3 Subspace A table of events occurrence helps to explain the idea behind Traditional VSM generates a space that all training samples this method. And these two events refer to a word t and a class could be represented in one single space, and this space is c. A is the number of documents with t and c co-occur, B is the called total space. However in Chinese web page number of documents with t occurs but not c, C is the reverse categorization, we find that there are obstacles in using total of B, and D is the number of documents with neither t nor c space. An obvious problem emerged when the number of occurs. Here the number of documents can also be represented predefined classes or total training samples is large. To include as probability, and the corresponding values are proportional. adequate represented features for each class, total feature N refers to the total number of files in the dataset you are dimension will be extremely high, and such high dimension choosing features. This algorithm could be formulated by bring about much difficulty in computation. If we reduce the equation (1). dimension to a practical level, precision will decline due to the
sparse document vector and inadequate knowledge to tell apart 2N?AD?CB?? (1) 2tc????,different classes. ????????A?C?B?D?A?B?C?DTo address this problem, we adopt subspace method. Feature 2Statistics is based on the hypothesis that high frequency ?subset is chosen for each class, and training samples in this
class are represented as points in a subspace generated by this words, whatever class they are in, are more helpful to
feature subset. As illustrated by Figure 3, each subspace has distinguish different classes. Yang compared main feature 2much fewer features than total space, and these features could selection technologies and demonstrated that Statistics ?reveal the character of the special class more comprehensively method outperformed other methods on public corpora . and accurately. On the other hand, training samples could be And a number of Chinese text categorization systems have expressed as vectors that have a predestined direction, and this adopted this method . representation tends to preserve class-related information. However this method has its limitation. Firstly when A->0 and Therefore, it is not only easier to accomplish the computation, B->N, which means a word common in other class, but not but also to discriminate different classes. 2appearing in the studied class, will be relatively large, ???t,c
so the weight of common words of other classes often precede
many important words in this class; secondly when A->0 and 2B->0, ->0, showing that low frequency words in this ???t,c
class are tend to be removed, which will cause much
To explain how these factors affect the final decision of class
label, we first present equation (4) expressing how to compute
the weight of a feature.
N (4) (,)?(,)?log(?0.01)Wtdfreqtdnt
It is the TF-IDF method, refers to times of word t freq(t,d)
appearing in document d, N and here are confined within nt
one class, respectively meaning total number of files and
number of files with word t appearing in this class. If word t is a VIP word, then equation (4) is changed to equation (5).
Figure 3. Sketch Map of Totals pace and Subspace '(,)?_[_]?Wtdclassweightclassid (5) N3.4 Adaptive Factors _[_]?(,)?log(?0.01)VIPfactorclassidfreqtdnThe task we are dealing with is to automatic classify huge tamount and highly dynamic web pages, hence a classifier
trained on public corpus is often not compatible with real data, 4. IMPLEMENTATION and to update the training corpus frequently so as to catch the The system is written in ANSI C. The required libraries are all changes of web data is also not practical. Although the corpus open-source and free. It is tested under Linux, with 256M we use is totally composed of Chinese web pages, it is still not Memory. It compiles successfully under Solaris and MS satisfactory because the samples distribute greatly unbalanced Windows with a few changes. And it is also easily portable among different classes, and the content of these samples and customizable. cannot cover the majority subfield of these classes. Therefore
we adopt adaptive factors to adjust the classifier model and To set up an efficient database is significant in training process, make the classifier more adaptable to the web data. especially when we are facing with massive collections. We
use Berkeley DB , the most widely used embedded data We incorporate two kinds of adaptive factors in our system. management software in the world. Berkeley DB provides a
programming library for managing (key, value) pairs, both of 3.4.1 Class Weight which can be arbitrary binary data of any length. It offers four A default hypothesis in text categorization is that the access methods, including B-trees and linear hashing, and probability of different classes is equal. However, we observed supports transactions, locking, and recovery.  Another merit that not only in real world there exists imparity between of this embedded database is that it is linked (at compile-time different classes but also our classifier is discriminatory to or run-time) into an application and act as its persistent storage some classes. For example, on one hand, web pages belong to manager.  “computer” class are more than those of “social science” class, In our system, a total training statistic DB is established, and on the other hand, our classifier tends to recognize a which contains the frequency and file numbers that a word “computer” related document as belongs to “social science” appeared in each class and all the training set. During the same class because the content of this file contains many not-explicit process, a small file DB is generated for each file recording the features and the later class covers a much wider range of word and its frequency in this file. features than computer class.
In testing process, we do not use any custom implementation in Class weight is a vector whose dimension equals with the order to avoid extra disk I/O, with the needed statistics or word number of classes. At the beginning, we set Class weight list is loaded into memory initially. We optimized each step of according to the training samples contained in each class, and testing process to improve system speed: simplifying Chinese normalized them to a real number between 0 and 1. Then in word segmentation algorithm; adopting two-level-structured open tests, we find many users to test our system with random dictionaries, making full use of Chinese letter coding rule, and selected web pages, and bring feedback results to us. We loading dictionary and other word lists into B-trees. Therefore optimized the class weight factor and achieved much higher we achieved fast speed, and it could test 3000 medium-sized precision in later open testing process. Chinese web page within 50 seconds. 3.4.2 VIP factor
5. EXPERIMENT We notice that there are some critical words that are very
important for web page categorization, such as „movie‟ for 5.1 Corpus entertainment class and „stock‟ for financial class, so we set up Currently, there is a lack of publicly available Chinese corpus a very important feature list, and accordingly an array of VIP for evaluating various Chinese text categorization systems . factors for each class. VIP factors are different among classes Although Chinese corpus is available from some famous because VIP words‟ effects on different class are not the same. English corpus resources such as TREC, whose corpus is Our definition of VIP factor is as simple as Class Weight, and mainly attained from Chinese news sites, we studied those if a word is in VIP word list, a VIP factor will be considered. corpora and found their contents to some extent outdated and Initially the factors were all the same, and were adjusted by topic-limited, so they are not suitable for building up a user feedback later. practical Chinese web page system.
2 0.259259 0.583333 0.358974 Fortunately, Peking University held a Chinese web page
categorization competition and provided a public available 3 0.812183 0.884146 0.784314 corpus as the standard (called PKU corpus for short), which
became our testbed. This corpus is created by downloading 4 0.961661 0.815718 0.882698 various Chinese web pages that cover the predefined topics. 5 0.859873 0.823171 0.841121 And there is a great diversity among the web pages in terms of
document length, style, and content. 6 0.802768 0.800000 0.801382
Our experiment is based on the corpus consisting of 11 top-7 0.658768 0.847561 0.741333 level categories and around 14000 documents. The corpus is
further partitioned into training and testing data by one 8 0.903448 0.836170 0.868508
attribute of each document, which is set by the provider. The 9 0.883978 0.695652 0.778589 training sample distribution is far from balanced, and the
documents in Class 2 cover only a small area of topics in this 10 0.735450 0.960829 0.833167 category, so we enlarged this corpus by adding several hundred 11 0.955932 0.938436 0.947103 web pages to strengthen such too weak classes a little.
Detailed information of this corpus is shown in Table 1. Micro-ave 0.862267 0.828680 0.845140
Table 1. PKU corpus statistics (revised version) From Table 2, we could find that all of the precision, recall or
F-measure are distributed much unbalanced among these 11 # Category Train Test Total classes, but the value of these three measures is comparable
for the same class. 1 Literature and Art 396 101 497
It is our first observation that the classifier‟s precision for a 2 News and Media 284 18 302
special class has a close relation with the number of training 3 Business and Economy 852 211 1063 samples in this class. Figure 4 demonstrated that for
unbalanced distributed corpus, the classes which own more 4 Entertainment 1479 369 1848 training samples are tend to achieve better result in its scale. 5 Politics and Government 341 82 423 And this phenomenon can be explained by machine learning
principle that only when the machine learn enough knowledge 6 Society and Culture 1063 290 1353 in a field, could it recognize new object of it. 7 Education 401 82 483
8 Natural Science 1788 470 2258
9 Social Science 1700 460 2160
Computer Science and 10 829 217 1046 Network
11 Medicine and Health 2240 601 2841
Total 11373 2901 14274
Common performance measures for system evaluation are: Precision (P): The proportion of the predicted documents for a Figure 4. Relationship between Classifier Performance and given category that are classified correctly. Number of Training Samples in Each Class Recall (R): The proportion of documents for a given category
that are classified correctly. Another observation is through checking the error classified F-measure: The harmonic mean of precision and recall. samples and low precision classes. We find that class 2 is 2?R?P (6) obviously difficult for the classifier, because of its lack of F?(R?P)training samples and content inconsistency in training and
testing part. Although the result seems not very attracting, we 5.3 Results and Discussion find the performance in practical use outperform the We show in Table 2 that the result of our system on previous experiment, with open testing result above 85% stably. described corpus. Micro-averaged scores are produced across
the experiments, which means that the performance measures 6. CONCLUSIONS AND FUTURE WORK are produced across the documents by adding up all the Employing classification algorithm effectively into practical documents counts across the different tests and calculated system is one of the main tasks of text categorization today. In using there summed values . this paper, we present an efficient Chinese web page Table 2. Experimental Results categorization classifier and its advantages could be concluded
as: # Precision Recall F-measure
Clear Design We have not included many extra modules in 1 0.829787 0.772277 0.800000 the system, and just follow the traditional structure of text
 Jie Chunyu, Liu Yuan, Liang Nanyuan, Analysis of categorization system. This helps to clearly define function of Chinese Automatic Segmentation Methods, Journal of each step and check the performance of each module. It is also Chinese Information Processing, 3(1):1-9, 1989. very easy to employ other methods into this system, which  Joachims, T. A Probabilistic Analysis of the Rocchio means just take place the corresponding module. Algorithm with TFIDF for Text Categorization. In Novel Technologies Involvement We believe that a perfect Proceedings of International Conference on Machine algorithm could not achieve good result if the prepared work is Learning (ICML’97), 1997. not done well. Each step of the system is significant, and  Ken Lang. NewsWeeder: Learning to filter netnews. In should provide the best service for next step. The previous Machine Learning: Proceedings of the Twelfth chapters have shown that this system has some tricks and new International Conference, Lake Taho, California, 1995. features in each module that contributes greatly to the final
performance.  Lewis, D.D. and Ringuette, M. A Comparison of Two
Learning algorithms for Text Categorization. In Third Optimized implementation Another important factor of a Annual Symposium on Document Analysis and system is its implementation. We adopt high-efficiency Information Retrieval, 81-93, 1994. database and optimized data structure and coding style, thus
the speed of this system is very fast.  Melnik S. et al. Building a Distributed Full-Text Index
for the Web. Technical report, Stanford Digital Library Above all, this is a classifier with good performance and fast
Project, July 2000. speed. It is of great practical value, and has provided services
for some search engines in China.  Pang Jianfeng, Bu Dongbo and Bai Shuo. Research and
Implementation of Text Categorization System Based on In the near future, we need to make it more customizable, VSM. Application Research of Computers, 2001. including the class structure definition and class hierarchy
scalability. Another work to do is to further strengthen the  Peking University Working Report on Information feedback effect of training process, and the first step is Retrieval. http://net.cs.pku.edu.cn/opr/fsc_0628.ppt establish user feedback interface at search engines and set up a
 Resource of Berkeley-DB. http://www.sleepycat.com/ mechanism to better utilize the information provided by users.
In this way, we could more easily update training set and  Spitters, M. Comparing feature sets for learning text adjust the distribution and content of each class. We also categorization. Proceedings of RIAO 2000, April 2000. envision being able to use unlabeled data to counter the
 Yang, Y. An Evaluation of Statistical Approaches to Text limiting effect of classes with not enough examples.
Categorization. Information Retrieval Journal, 1999,
v1(1/2): 42-49. 7. ACKNOWLEDGMENTS
We thank Dr. Li Yue for her useful suggestions. This material  Yang, Y., Jan O.Pedersen, A comparative Study on is based on hard work in part by Xu Jingfang and Wu Juan of Feature Selection in Text Categorization. Proc. of, the Tsinghua University. th14 International Conference on Machine Learning,
ICML-97, pp.412-420, 1997. 8. REFERENCES
 Cortes, C. and Vapnik, V.N. Support Vector Networks. 9. About Authors Machine Learning, 20:273-297, 1995. Liu Hui
She is pursuing master degree at DEE of Tsinghua University  Faloutsos C. and Christodoulakis S. Signature files: An and is directed by Prof. Li Xing. She is interested in access method for documents and its analytical Information Retrieval, Machine Learning and Pattern performance evaluation. ACM Transactions on Office Recognition. Information Systems, 2(4): 267-288, October 1984. Address: Room 304, Main Building, Tsinghua Univ. Beijing  Freund, Y. and Schapire, R.E. A decision-theoretic 100084, P.R.China generalization of on-line learning and an application to Telephone: 8610-62785005-525 boosting. Journal of Computer and System Sciences, Email: firstname.lastname@example.org 55(1): 119-139, 1997. Peng Ran  Giuseppe Attardi, Antonio Gull, and Fabrizio Sebastiani. She is an undergraduate student of Beihang University, and is Automatic Web Page Categorization by Link and Context doing her graduation project at Tsinghua University. Her Analysis. In Chris Hutchison and Gaetano Lanzarone, research field is mainly text categorization and machine editors, Proceedings of THAI'99, European Symposium learning. on Telematics, Hypermedia and Artificial Intelligence, Address: Room 304, Main Building, Tsinghua Univ. Beijing 105--119, Varese, IT, 1999.1. 100084, P.R.China
Telephone: 8610-62785005-525  Ji He, Ah-Hwee Tan and Chew-Lim Tan. Machine Email: email@example.com Learning Methods for Chinese Web Page Categorization.
Ye Shaozhi ACL'2000 2nd Workshop on Chinese Language
He is pursing master degree at DEE of Tsinghua University. Processing, 93-100, October 2000. Directed by Prof. Li Xing, his research area is web crawler,  Ji He, Ah-hwee Tan, Chew-lim Tan. On Machine IPv6 web development, Ftp search engine, Pattern Recognition Learning Method for Chinese Text Categorization. and distributed system. Applied Science, 18, 311-322, 2003.
(CERNET) Center. Being one of the major architects of Address: Room 321, Eastern Main Building, Tsinghua Univ. CERNET, his research interests include statistical signal Beijing 100084, P.R.China processing, multimedia communication and computer networks. Telephone: 8610-62792161 Address: Room 225, Main Building, Tsinghua Univ. Beijing firstname.lastname@example.org Email: 100084, P.R.China Li Xing Telephone: 8610-62785983He is the Professor at DEE of Tsinghua University as well as Email: email@example.com the Deputy Director of China Education and Research Network