Design and Implementation of Domain-specific Business Information
Search System in Electronic Commerce Environment
111, 21Ruijun Xia, Qing Wang, Dingwei Wang, Lili Liu
1. Institute of System Engineering, Information College, Northeastern University, Shenyang, 110004
2. Modern Logistics Center, Shanghai University, Shanghai, 200072, China
Abstract: In Electronic Commerce (EC) environment, the quality of business information directly affects the level of enterprise operations. This paper analyses the common methods of business information retrieval in EC environment,
and design a software system which can gather business information in internet automatically and extract business information demanded by enterprise from database directly. The system adopts meta-search engine to extend search range, and applies information retrieval, web mining and agent technology to analyze and filter the business information, improved the search quality of business information.
Key Words: Electronic Commerce (EC); Business information; Information Retrieval (IR); Meta-Search Engine
; ? Using specialized search engines to search business 1 INTRODUCTION information, it has a good search result, but the amount of
results is limited, and relies on the database of their site. In recent years, the application of Electronic Commerce
In view of the above problems, this paper designs a business (EC) becomes more and more widespread. The enterprises
information search system in Electronic Commerce need more and more business information such as raw
environment, which can gather business information in material, product, supplier and customer, and use this
internet automatically and extract business information information to provide information support for
demanded by enterprise form database directly. The system decision-making of enterprise. So, whether or not the
adopts meta-search engine which can be integrated with enterprise in Electronic Commerce environment would
several General-purpose Search Engines(GSE) to extend access to the accurate, comprehensive and necessary
search range and improve the recall, and applies business information in time will bear on the success and
information retrieve, web mining and agent technology to failure of Electronic Commerce operation. The enterprise
analyze and filter business information, extract customer, must go beyond the relatively narrow operation
supplier and product information which has potential value environment in the past, collect and use business
to enterprise, improved the precision. information effectively.
2 DOMAIN-SPECIFIC BUSINESS In Electronic Commerce environment, the main methods to
INFORMATION SEARCH search business information for enterprise are as following:
? Using General-purpose Search Engine (GSE) to Search There are many scholars researched for the [2-6]business information, it covers a wide range of business domain-specific business search. Paper  proposed an information but contains too many irrelevant pages, results agent-based framework for dynamic information retrieve in a low precision, and could not meet the personalization process to manage the business status intelligently and requirements of user. dynamically. Paper  presented a method to build ? Logging in web site of enterprise to search business personalized domain-specific search engine, adopted information, it can get information accurately such as type domain-based grading thesauruses and Chinese and price of products of this enterprise, but the search range segmentation algorithm with disambiguation mechanism is very limited and also result in a low recall. to ensure high accuracy, and adopted retrospective, state ? Logging in large business portal website to search memory and linear nature of segmentation algorithm to business information, it contains a lot of product ensure engine’s efficiency. Paper  proposed a Hopfield information, but not all the enterprises issue their product neural network based business search algorithm, a set of information to this website, so in comparison with entire extended query terms are generated automatically by business information in Internet, the amount of Hopfield neural network in accordance with the query information in these site are very limited, and could not keywords the users input. Searching general-purpose meet the requirements of enterprise. search engine with those extended query terms can extend
search range and improve search precision. Paper  This work is supported by National Nature Science Foundation under proposed a Bayesian Network (BN) based business Grant 74105110, Innovative Research Team Project of National Natural information retrieve model, in this model the customized Science Foundation under grant 60821063.
query requirement of enterprise is expressed in terms of the 3.2 Architecture based on MSE
predefined illustrative documents related to business This system adopts MSE based architecture as shown in domain. The similarities between the documents and the Fig 1. It is divided into 3 main modules, including query are evaluated with the conditional probabilities meta-search and system search module, user search module among the nodes in the BN. Paper  proposed a method and user interaction module. Each module includes various for building Domain-specific search engine based on sub-modules. Meta-Search Engine (MSE) on internet, It selects keywords  This system applies Luceneas database to enhance by the Odds Ratio (OR) method and weights them by the indexing and retrieve functions. Lucene is a full-text TF-IDF method. Domain query expression is derived by the indexing tool wrap based on Java, which provides a number Decision Tree (DT) method. Finally, it ranks the returned of API functions and flexible data storage structure(can be documents by the Extended Boolean Model. The method customized), and can be easily embedded into various can effectively remedy the drawbacks of KS method and applications to achieve or enhance indexing and retrieval can perform better in terms of precision and recall. functions. Being different from other databases, Lucene Based on the study and analysis to existing theory research, stores information in the form of index file, and the retrieve this paper designs and implements a domain-specific speed quicker than other databases. In addition, it doesn’t business search software which adopted MSE as framework, adopt B-tree structure which cause a large number of IO applies the theory to practical system in the form of operation as updating index, but creates a new index file, modularization and then merges these small index files into a large one, so
as to enhance the indexing efficiency without affecting the 3 DESIGN OF DOMAIN-SPECIFIC
search efficiency. BUSINESS INFORMATION SEARCH This system sets up index and achieves user search SYSTEM functions through APIs provided by Lucene primarily. We In order to help business person to get information such as could also add some information retrieve model (such as commodity, supplier and customer, and provide reference Hopfield neural network based information retrieve model for further inquiry and commodity pricing, the system is ) to the user search module, enable user to get more designed to collect business information required by precise business information. enterprise in internet automatically according to the
character of business information.
3.1 Main functions of the system
? Meta-search engine function: The user can enter
several keywords belonged to the field of business, search
business information from several GSE, remove duplicated
and invalid pages, parse pages, and extract the abstract or
full text of the pages.
? System search function: The system can gather
relevant information regularly and automatically in
internet according to the pre-determined system search
keywords and search time, and deposit them in the
? User search function: The system can retrieve the
database according to query statement entered by user. As
business information required by user, the retrieve results Fig 1. Basic architecture of business information search system will be returned to the user.
? User interaction function: The user can achieve basic
operation such as input, output and parameter setting, feed 3.3 Design of the system module back evaluation information about the query results, modify
the parameters of retrieve model, and modify query Fig 2 shows the detailed function of the system. In order to expression to adapt to the changes of network environment achieve management function of the system, we add the and information requirement, so as to get more accurate system management module. The detailed functions of and valuable business information. each module are as following.
? System management function: The user can achieve 3.3.1 Meta-search and system search module the initialization for the necessary relevant parameters,
define the update strategy of business information, record This module includes 3 sub-modules: domain-specific the user’s visits to the system and information update time, expression sub-module, search engine agent sub-module and set user access rights. and information extraction sub-module. The functions of
each sub-module are as following.
? Domain-specific expression module: (a) Selecting
domain-specific keywords: adopting the Odds Ratio (OR) information retrieve models such as KS method based method to select domain-specific keywords according to , Bayesian network business information retrieve model sample documents, and weighting the domain-specific based business information retrieval model ).
keywords. (b) Generating domain query expression: using 3.3.2 User search module domain-specific keywords to structure domain query
expression, and modifying domain query expression This module includes 2 sub-modules: query statement according to the modification information fed back by processing sub-module and information retrieve module. users. The functions of each sub-module are as following. ? Search engine agent sub-module: (a) Structuring (1) Query statement processing sub-module: Lexical query URL: Structuring query URL according to the system analysis to the query statement entered by user with an search keywords and domain-specific expression, and objective of extracting keywords, then submitting the submitting query URL simultaneously to several individual particular logical expression composed of keywords to the GSE according to HTTP protocol. The purpose is to collect query parser of Lucene.
a large number of domain-specific business information. (b) (2) Information retrieve sub-module: (a) Query engine: Analyzing webpage: Getting search result pages according Calling the IndexSearcher class of Lucene to retrieve from to HTTP protocol, analyzing the links of these pages, business information database according to the query removing invalid and duplicate pages, and saving expression, constituting results set with all records got from remained pages to the buffer. the database. (b) Query result processing: In accordance ? Information extraction sub-module: (a) Parsing page: with certain algorithm filtering and ranking the query Using HTML Parser to parse pages, removing the HTML results, as business information return to enterprise user. tags, and extracting summary or full-text. (b) Document 3.3.3 User interaction module preprocessing: Removing punctuations, stop words, etc,
and extracting some important nouns and verbs as index This module includes 2 sub-modules: User interface terms. (c) Weighting the index terms: Adopting the sub-module and User feedback sub-module. The functions TF-IDF method to weight the index terms. (d) Similarity of each sub-module are as following.
calculation: Structuring user query vector and pages vector, (1) User interface sub-module: (a) Submitting input using the cosine of angle between two vectors to express the information: Users submit the system search keywords to similarity between user query and pages, and select the the search engine agent module, and submit user query pages with a degree of similarity above a certain threshold statement to the query statement processing module. (b) as business information. (e) Page information extraction: Submitting feedback information: Users evaluate the Further processing the selected pages, and extracting some degree of similarity between retrieve results and query important information such as URL, title, summary and requirement, and submit the evaluation information to the update time. (f) Creating index: Using the IndexWriter user feedback module. (c) Displaying result information: class of Lucene to create index for extracted page Outputting the business information required by enterprise information so as to be retrieved by enterprise user. The user through the visual interface.
index structure of business information as shown in Table1. (2) User feedback sub-module: (a) evaluation result
analysis: According to the evaluation result information, Table1.Index structure of business information analyzing the retrieve results, calculating the relevant
FIELD INDEX TOKENIZED STORE parameters, or directly putting the retrieve results into
sample document database as business information Document ID NO NO YES samples. (b) Domain-specific expression modification: Page URL NO NO YES According to the analysis result, setting the way of Page Title YES YES YES modification of domain-specific expression, such as Page Summary YES YES YES modifying the weights of domain-specific keywords, Page Update Time YES NO YES updating the domain-specific expression, etc. (c) Search
keywords modification: According to the analysis result, modifying user search keywords. The Vector Space Model with query expansion function
is applied in this module (could also apply other
Fig 2.Detailed function framework of business information search system
SystemSearch package is described with UML class 3.3.4 System management module
diagram, as shown in Fig 4 and Fig 5 respectively. The This module includes 4 sub-modules: system initialization classes in PageParsing are responsible for parsing pages, sub-module, information updating sub-module, log while the classes in SystemSearch package call the management sub-module, and user management WebParserWrapper class in PageParsing package to sub-module. The functions of each sub-module are as achieve other functions of this module. following
(1) System initialization sub-module: Setting the
necessary relative parameters, removing the data of
business information database, etc.
(2) Information updating sub-module: Setting update
strategy, update time and update mode of the business
(3) Log management sub-module: Recording some
information such as user’s visit to the system, information
updating time, etc.
(4) User management sub-module: Managing basic
information of the users, setting user access rights, etc. Fig 4.Class diagram of the PageParsing package 4 IMPLEMENTATION OF DOMAIN-SPECIFIC BUSINESS
INFORMATION SEARCH SYSTEM
In order to verify the design effect, a business information search system has been developed, which aim to the
domain of auto parts and integrate with 3 famous GSE:
Google, Baidu and Sougou. We adopted Java language, and
applied the JDK1.6 as Java Virtual Machine (JVM) and
Eclipse 3.2 as development platform in the system.
4.1 Class design of main module of the system
The entire program of the system is divided into 4 packages: PageParsing, SystemSearch, UserSearch and
MainInterface. The PageParsing package and the
SystemSearch package are used for implementation of
meta-search and system search functions, the UserSearch
package is used for implementation of user search function,
and the MainInterface package is used for implementation
of user interaction and system management functions. The inter-relationship between each package is described with Fig 5.Class diagram of the SystemSearch package UML package diagram, as shown in Fig 3
4.2 Operation effect of the system
Fig 6 shows an interface that the system collects business
information from the Internet. User can choose GSE and
set search keywords, for example: we choose Baidu GSE
and set “汽油发动机 ” as keywords, then the search results
as shown in Fig 6.
The system can also achieve automatic search function, the
steps are as following:
(1) The user sets several search keywords and search time
(such as the rest time) on the automatic search interface, . and saves them to the database. Fig 3.Package diagram of the domain-specific business search system (2) According to the search time, the system extracts search keywords from the database periodically and The core module of the system is meta-search and system submits them to three GSE respectively. search module. This module is taken as an example to (3) The system analyzes and filters the search results and describe the implementation method of the system. The automatically saves domain-specific business information relationship of the classes in PageParsing package and
to the database. 5 CONCLUSION
Aim at the business information search problem in Electronic Commerce, this paper designed a software system which can automatically gather business information in internet and conveniently extract information demanded by enterprise from database at any time, and implemented with Java development tools. This
system adopts meta-search engine to extend search range, and applies information retrieval, web mining and agent technology to analyze and filter the business information, improved the search quality of business information. This system can be embedded into existing management information system of enterprise. Through collecting business information in internet continually and setting up huge business information database, it could provide comprehensive and accurate information support for decision-making of enterprise in electronic commerce environment. Fig 6.Interface of system search
Fig.7 shows an interface that the system retrieves business  Wang Qing, Wang Zheng, Wang Dingwei1, Application of information from the database. User can simultaneously Web Mining in Business, Computer Engineering, Vol.34, enter several keywords for Boolean query; filter the search No.11, 197-199, 2008.
results according the update time of the pages and rank the  Hua Hu, Bin Xu, An Agent-based Framework for Intelligent search results according to the relevance or update time. and Dynamic Business Information Retrieval, Workshop on
Intelligent Information Technology Application, DOI
 Lei Zhang, Yong Peng, Xiangwu Meng, and Jie Guo,
Personalized Domain-specific Search Engine, Industrial
Informatics, 1308-1313, 2008.
 Zheng Wang, Qing Wang, Dingwei Wang, Searching
Business Information with Hopfield Neural Network in
Electronic Commerce Environment, International
Conference on Bio-Inspired Computing: Theories and
Applications (BIC-TA 2007), 552-554,2007.
 Zheng Wang, Qing Wang, Ding-Wei Wang, Bayesian
network based business information retrieval model,
Knowledge and Information Systems, DOI
 Zheng Wang, Qing Wang, DingWei Wang, Application of
Domain-Specific Search Method in Meta-Search Engine on
Internet, IMACS Multi-conference on Computational
Engineering in Systems Applications, 2078-2085, 2006. Fig.7.Interface of user search  Otis Gospodnetic, Erik Hatcher, LUCENE IN ACTION, OREILLY & ASSOCIATES INC, 2005. This system is a previous experimental system. Though the  Ricardo B Y, Berthier R N, Modern Information Retrieval, interface is relatively simple, all the module functions of China Machine Press, Beijing, China, 2004. business information search system have been achieved,
and sufficient to verify the effectiveness of system design.