DOC

2000-Diligenti-VLDB

By Travis Washington,2014-03-21 22:20
17 views 0
2000-Diligenti-VLDB

    Michelangelo Diligenti, Frans Coetzee, Steve Lawrence, C. Lee Giles, Marco Gori. Focused Crawling using Context Graphs, 26th

    International Conference on Very Large Databases, VLDB 2000, Cairo, Egypt, pp. 527534, 2000.

    Focused Crawling Using Context Graphs

     M. Diligenti, F. M. Coetzee, S. Lawrence, C. L. Giles and M. Gori

    NEC Research Institute, 4 Independence Way, Princeton, NJ 08540-6634 USA Dipartimento di Ingegneria dell’Informazione, Universita` di Siena

    Via Roma, 56 - 53100 Siena, Italy

     diligmic,gori @dii.unisi.it, coetzee,lawrence,giles @research.nj.nec.com

     such as news, financial data, entertainment and schedules become widely disseminated via the web. Search engines

    are therefore increasingly challenged when trying to main- Abstract

    tain current indices using exhaustive crawling. Even using state of the art systems such as AltaVistas Scooter, which Maintaining currency of search engine indices by reportedly crawls ten million pages per day, an exhaustive exhaustive crawling is rapidly becoming impossi- crawl of the web can take weeks. Exhaustive crawls also ble due to the increasing size and dynamic content consume vast storage and bandwidth resources, some of of the web. Focused crawlers aim to search only which are not under the control of the search engine. the subset of the web related to a specific cate- Focused crawlers [2, 3] aim to search and retrieve only gory, and offer a potential solution to the currency the subset of the world-wide web that pertains to a spe- problem. The major problem in focused crawl- cific topic of relevance. The ideal focused crawler retrieves ing is performing appropriate credit assignment the maximal set of relevant pages while simultaneously to different documents along a crawl path, such traversing the minimal number of irrelevant documents on that short-term gains are not pursued at the ex- the web. Focused crawlers therefore offer a potential so- pense of less-obvious crawl paths that ultimately lution to the currency problem by allowing for standard yield larger sets of valuable pages. To address exhaustive crawls to be supplemented by focused crawls this problem we present a focused crawling algo- for categories where content changes quickly. Focused rithm that builds a model for the context within crawlers are also well suited to efficiently generate indices which topically relevant pages occur on the web. for niche search engines maintained by portals and user This context model can capture typical link hierar- groups [4], where limited bandwidth and storage space are chies within which valuable pages occur, as well the norm [5]. Finally, due to the limited resources used by as model content on documents that frequently co- a good focused crawler, users are already using personal occur with relevant pages. Our algorithm further PC based implementations [6]. Ultimately simple focused leverages the existing capability of large search crawlers could become the method of choice for users to engines to provide partial reverse crawling capa- perform comprehensive searches of web-related materials. bilities. Our algorithm shows significant perfor- While promising, the technology that supports focused mance improvements in crawling efficiency over crawling is still in its infancy. The major open problem standard focused crawling. in focused crawling is that of properly assigning credit to all pages along a crawl route that yields a highly relevant 1 Introduction document. In the absence of a reliable credit assignment strategy, focused crawlers suffer from a limited ability to The size of the publicly indexable world-wide-web has 9sacrifice short term document retrieval gains in the interest provably surpassed one billion (10) documents [1] and as of better overall crawl performance. In particular, existing yet growth shows no sign of leveling off. Dynamic con- crawlers still fall short in learning strategies where topically tent on the web is also growing as time-sensitive materials, relevant documents are found by following off-topic pages. We demonstrate that credit assignment for focused Permission to copy without fee all or part of this material is granted pro- crawlers can be significantly improved by equipping the vided that the copies are not made or distributed for direct commercial crawler with the capability of modeling the context within advantage, the VLDB copyright notice and the title of the publication and its date appear, and notice is given that copying is by permission of the which the topical materials is usually found on the web. Very Large Data Base Endowment. To copy otherwise, or to republish, Such a context model has to capture typical link hierarchies requires a fee and/or special permission from the Endowment. within which valuable pages occur, as well as describe Proceedings of the 26th VLDB Conference, off-topic content that co-occurs in documents that are fre- Cairo, Egypt, pp. 527534, 2000.

    quently closely associated with relevant pages. We present queue. A best first search is performed by popping the next a general framework and a specific implementation of such page to analyze from the head of the queue. This strategy a context model, which we call a Context Graph. Our algo- ensures that the crawler preferentially pursues promising rithm further differs from existing focused crawlers in that crawl paths.

    it leverages the capability of existing exhaustive search en- The simplest focused crawlers use a fixed model of gines to provide partial reverse crawling capabilities. As a the relevancy class, typically encoded as a classifier, to result it has a rapid and efficient initialization phase, and is evaluate the documents that the current document links suitable for real-time services. to (henceforth referred to as children) for topical rele- The outline of the paper is as follows: Section 2 pro- vancy [2, 3, 11]. The classifier is either provided by the user vides a more detailed overview of focused crawling. Sec- in the form of query terms, or can be built from a set of seed tion 3 describes the architecture and implementation of our documents. Each link is assigned the score of the document approach. Comparisons with existing focused crawling al- to which it leads. More advanced crawlers adapt the clas- gorithms on some test crawls are shown in Section 4, and sifier based on the retrieved documents, and also model the we conclude by discussing extensions and implications in within-page context of each hyperlink. In the most com- Section 5. mon adaptive crawlers decision directed feedback is used, where documents that are marked as relevant by the clas- 2 Prior Work in Crawling sifier are also used to update the classifier. However, en- suring flexibility in the classifier, without simultaneously The first generation of crawlers [7] on which most of the corrupting the classifier, is difficult. web search engines are based rely heavily on traditional A major problem faced by the above focused crawlers graph algorithms, such as breadth-first or depth-first traver- is that it is frequently difficult to learn that some sets of sal, to index the web. A core set of URLs are used as a seed off-topic documents often lead reliably to highly relevant set, and the algorithm recursively follows hyper links down documents. This deficiency causes problems in traversing to other documents. Document content is paid little heed, the hierarchical page layouts that commonly occur on the since the ultimate goal of the crawl is to cover the whole web. Consider for example a researcher looking for pa- web. pers on neural networks. A large number of these papers are found on the home pages of researchers at computer science departments at universities. When a crawler finds the home page of a university, a good strategy would be to follow the path to the computer science (CS) department, then to the researchers’ pages, even though the university and CS department pages in general would have low rele- vancy scores. While an adaptive focused crawler described above could in principle learn this strategy, it is doubtful that the crawler would ever explore such a path in the first a) Standard Crawling b) Focus Crawling place, especially as the length of the path to be traversed increases.

    Figure 1: a) A standard crawler follows each link, typically To explicitly address this problem, Rennie and McCal- applying a breadth first strategy. If the crawler starts from lum [12] used reinforcement learning to train a crawler on a document which is i steps from a target document, all the specified example web sites containing target documents. documents that are up to i 1 steps from the starting docu- The web site or server on which the document appears is re- ment must be downloaded before the crawler hits the target. peatedly crawled to learn how to construct optimized paths b) A focused crawler tries to identify the most promising to the target documents. However, this approach places a links, and ignores off-topic documents. If the crawler starts burden on the user to specify representative web sites. Ini- from a document which is i steps from a target document, tialization can be slow since the search could result in the it downloads a small subset of all the documents that are crawling of a substantial fraction of the host web site. Fur- up to i 1 steps from the starting document. If the search thermore, this approach could face difficulty when a hier- strategy is optimal the crawler takes only i steps to discover archy is distributed across a number of sites. the target. An additional difficulty faced by existing crawlers is that A focused crawler efficiently seeks out documents about links on the web are uni-directional, which effectively re- a specific topic and guides the search based on both the con- stricts searching to top-down traversal, a process that we tent and link structure of the web [2]. Figure 1 graphically call “forward crawling” (Obtaining pages that link to a par- illustrates the difference between an exhaustive breadth- ticular document is referred to as “backward crawling”).

    first crawler and a typical focused crawler. A focused Since web sites frequently have large components that are crawler implements a strategy that associates a score with organized as trees, entering a web site at a leaf can result each link in the pages it has downloaded [8, 9, 10]. The in a serious barrier to finding closely related pages. In our links are sorted according to the scores and inserted in a example, when a researchers home page is entered, say

    2

     via a link from a list of papers at a conference site, a good Contex Graph Back Crawl Stage strategy would be for the crawler to find the department member list, and then search the pages of other researchers in the department. However, unless an explicit link exists Learning Stage Set of Classifiers C1 C2 C3 from the researchers page to the CS department member Class Decision list, existing focused crawlers cannot move up the hierar- Crawling Stage Wchy to the CS department home page. Crawl W New Document EngineW Our focused crawler utilizes a compact context repre- sentation called a Context Graph to model and exploit hi- Actual Analized document erarchies. The crawler also utilizes the limited backward Queue 1 Queue2 Queue 3 Queue 4 crawling [13, 14] possible using general search engine in- . . . .. . . . dices to efficiently focus crawl the web. Unlike Rennie and McCallums approach [12], our approach does not learn the context within which target documents are located from a small set of web sites, but in principle can back crawl a sig- nificant fraction of the whole web starting at each seed or Figure 2: Graphical representation of the Context Focused on-topic document. Furthermore, the approach is more ef- Crawler. During the initialization stage a set of context ficient in initialization, since the context is constructed by graphs are constructed for the seed documents by back- directly branching out from the good set of documents to crawling the web, and classifiers for different graph layers model the parents, siblings and children of the seed set. are trained. During the crawling stage the crawl engine and the classifiers manage a set of queues that select the next 3 The Context Focused Crawler page on which to center the crawl.

    Our focused crawler, which we call the Context Focused

    Crawler (CFC), uses the limited capability of search en- a number of pages linking to the target are first retrieved gines like AltaVista or Google to allow users to query for (these pages are called the parents of the seed page). Each pages linking to a specified document. This data can be parent page is inserted into the graph as a node and an edge used to construct a representation of pages that occur within is established between the target document node and the a certain link distance (defined as the minimum number of parent node. The new nodes compose layer 1 of the context

    link traversals necessary to move from one page to another) graph. The back-crawling procedure is repeated to search of the target documents. This representation is used to train all the documents linking to documents of layer 1. These a set of classifiers, which are optimized to detect and assign pages are incorporated as nodes in the graph and compose documents to different categories based on the expected layer 2. The back-linking process is iterated, until a user- link distance from the document to the target document. specified number of layers have been filled. In practice the During the crawling stage the classifiers are used to predict number of elements in a given layer can increase suddenly how many steps away from a target document the current when the number of layers grow beyond some limit. In retrieved document is likely to be. This information is then such a case a statistical sampling of the parent nodes, up used to optimize the search. to some system dependent limit, is kept. To simplify the There are two distinct stages to using the algorithm link structure, we also use the convention that if two docu- when performing a focused crawl session: ments in layer i can be accessed from a common parent, the parent document appears two times in the layer i 1. As 1. An initialization phase when a set of context graphs a result, an equivalent induced graph can be created where and associated classifiers are constructed for each of each document in the layer i 1 is linked to one and only the seed documents one document in the layer i. We define the depth of a context graph to be the number 2. A crawling phase that uses the classifiers to guide the

    of layers in the graph excluding the level 0 (the node stor- search, and performs online updating of the context

    ing the seed document). When N levels are in the context graphs. graph, path strategies of up to N steps can be modeled. A The complete system is shown in Figure 2. We now de- context graph of depth 2 is shown in Figure 3. scribe the core elements in detail. By constructing a context graph, the crawler gains knowledge about topics that are directly or indirectly re- 3.1 Generating the Context Graphs lated to the target topic, as well as a very simple model of

    The first stage of a crawling session aims to extract the con- the paths that relate these pages to the target documents. As text within which target pages are typically found, and en- expected, in practice, we find that the arrangement of the codes this information in a context graph. A separate con- nodes in the layers reflect any hierarchical content struc- text graph is built for every seed element provided by the ture. Highly related content typically appears near the cen- user. Every seed document forms the first node of its as- ter of the graph, while the outer layers contain more gen- sociated context graph. Using an engine such as Google, eral pages. As a result, when the crawler discovers a page

    3

     ment Frequency). TF-IDF representation [15] describes a document as a vector relative to a basis of phrases that de- fine a vocabulary V . Each element in the vector represents the frequency of occurrence of the phrase in the document, weighted according to the discrimination implied by the presence of the phrase within a reference corpus. This dis- crimination is approximated by the phrase frequency in the reference corpus. If two phrases have the same number of occurrences in a document, the TF-IDF value of the less common phrase will be higher. The TF-IDF score w of Target Example a phrase w is computed using the following function: Layer 1 d Nf w (1) w logd f wf max Layer 2 d where f w is the number of occurrences of w in a docu- d ment d, f is the maximum number of occurrences of a max phrase in a document d, N is the number of documents in

    the reference corpus and f w is the number of documents Representation of the Target Document in the corpus where the phrase w occurs at least once. We implement TF-IDF using the following steps. All Document Representations of layer 1 the documents in the seed set, as well as optionally, the Document Representations of layer 2 first layer, are concatenated into a single master document. 1. All stop-words such as “by”, “and”, or “at” are re-Figure 3: A context graph represents how a target docu- moved from the master document ment can be accessed from the web. In each node a web 2. Word stemming is performed to remove com- document representation is stored. The graph is organized mon word transformations, such as plurals or case into layers: each node of layer i is connected to one (and changes [16]; only one) node of the layer i 1 (except the single node in layer 0). There are no connections between nodes at the 3. TF-IDF computation is performed using a reference same level. The seed document is stored in layer 0. A corpus derived from an exhaustive web crawl. document is in layer i if at least i steps (link followings) The resulting set of phrases form the vocabulary V . When are needed to reach the target page starting from that docu-

    a document is retrieved, the TF-IDF score for phrases in ment.

     V that occur in the document are computed and placed in with content that occurs higher up in the hierarchy, it can a vector. We then truncate the vector representation to in- use its knowledge of the graph structure to guide the search clude only the forty highest scoring components by zeroing towards the target pages. Returning to our example of look- out the remaining components. This processing, yielding a ing for neural network pages, the context graph will typi- representation which we will refer to as reduced TF-IDF cally discover a hierarchy where levels correspond to re- representation, ensures numerical stability and high speed searcher home pages, research group pages, and ultimately of the classifiers.

    department and university home pages. Given a representation procedure, we next construct the Once context graphs for all seed documents have been classifiers. Our goal is to be able to assign any web doc- built, the corresponding layers from the different context ument to a particular layer of the merged context graph. graphs are combined, yielding a layered structure that we However, if the document is a poor fit for a given layer, we call the Merged Context Graph. wish to discard the document, and label it as one of the cat- egory “other”. The major difficulty in implementing such a strategy using a single classifier mapping a document to 3.2 Learning Classifiers for the Context Graph Ele- a set of N 2 classes corresponding to the layers 0 1 N ments and a category “other”, is the absence of a good model or

    The next stage in initialization builds a set of classifiers for training set for the category “other”. To solve this prob-

    assigning any document retrieved from the web to one of lem we use a modification of the Naive Bayes Classifier

    the layers of the merged context graph, and for quantifying for each layer. This classifier architecture provides reason- the belief in such a classification assignment. able performance, high speed, meets the requirement of our

    system that a likelihood estimate be provided for each clas- The classifiers require a feature representation of the

    sification, and is well studied [17, 18, 12]. documents on which to operate. Our present implemen-

    tation uses keyword indexing of each document using a Assume that we have a document drepresented by the i modification of TF-IDF (Term Frequency Inverse Docu- vector corresponding to the reduced TF-IDF representation

    4

    The classifier of layer 0 is used as the ultimate arbiter relative to the vocabulary V . Documents from class c , de- j of whether a document is topically relevant. The discrimi- fined to correspond to layer j, are assumed to have a prior

    probability of being found on the web which we denote nant and likelihood functions for the other layers are used

    P c . The probability that a vector element woccurs in to predict for any page how many steps must be taken be- j t

    documents of class c is P wc . To classify a page, we fore a target is found by crawling the links that appear on a j t j

    first wish to find the class c such that P c dis maxi- page. j j i

    mized. The solution is given by Bayes rule 3.3 Crawling P c dP c P dc P c P wwc (2) j i j i j j d1 dj i i N The crawler utilizes the classifiers trained during the con- where wis the k-th feature of the document d. The di i k text graph generation stage to organize pages into a se- Naive Bayes assumption ignores joint statistics and for- quence of M N 2 queues, where N is the maximum mally assumes that given the document class the features depth of the context graphs. The i-th class (layer) is asso- occur independently of each other, yielding the final solu- ciated to the i-th queue i 0 1 N. Queue number N 1 tion is not associated with any class, but reflects assignments N dto “other”. The 0-th queue will ultimately store all the re- iP c dP c P dc P c P wc (3) j i j i j jdj trieved topically relevant documents. The system is shown i k k 1 graphically in Figure 2.

    Initially all the queues are empty except for the dummy where Nindicates the number of features in the document di queue N 1, which is initialized with the starting URL of d. If N denotes the maximum depth of the context graphs, i the crawl. The crawler retrieves the page pointed to by the then N 1 discriminant functions P c dare built corre- j i URL, computes the reduced vector representation and ex- sponding the layers j 0 1 N. tracts all the hyperlinks. The crawler then downloads all The discriminant functions allow for a given page dto i the children of the current page. All downloaded pages first be assigned to one of the layers of the merged context are classified individually and assigned to the queue corre- graph, by finding the layer j for which the discriminant sponding to the winning layer, or the class “other”. Each function P c dis maximized. Subsequently, by comput- j i queue is maintained in a sorted state according to the like- ing the likelihood function P c dfor the winning layer j i lihood score associated with its constituent documents. j , and comparing this likelihood to a threshold, it is possi- When the crawler needs the next document to move to, ble to discard weakly matching pages. These pages are ef- it pops from the first non-empty queue. The documents fectively marked as “other”, which avoids the need for con- that are expected to rapidly lead to targets are therefore struction of an explicit model of all documents not in the followed before documents that will in probability require context graph. In effect, we build a set of parallel two-class more steps to yield relevant pages. However, depending on Naive Bayes classifiers, one for each layer, and select the the relative queue thresholds, frequently high-confidence winning layer by maximizing the a-posteriori likelihood of

    pages from queues representing longer download paths are the layer based on the context graph.

    retrieved. In training the classifiers, the documents that occur in

    The setting of the classifier thresholds that determine layer j of all of the seed document context graphs are com-

    bined to serve as a training data set D . The phrase prob- whether a document gets assigned to the class denoted j “other” determines the retrieval strategy. In our default im- abilities P wc are computed on the sets D by counting t j j plementation the likelihood function for each layer is ap- the occurrences of the feature wand then normalizing for t plied to all the patterns in the training set for that layer. all the words in the documents of class c : j The confidence threshold is then set equal to the minimum 1 N wdP c dt i j idD likelihood obtained on the training set for the correspond-i j P wc (4) t jV ing layer. N wdP c d V s i j idD s 1 i j During the crawling phase, new context graphs can peri- where N wdis the number of occurrences of win the odically be built for every topically relevant element found t i t

    document dand V is the number of phrases in the vocab- in queue 0. However, our focused crawler can also be con- i ulary V [17, 18, 12]. figured to ask for the immediate parents of every document The parameters P c can be calculated by estimating as it appears in queue 0, and simply insert these into the ap- j

    propriate queue without re-computing the merged context the number of elements in each of the layers of the merged

    graph and classifiers. In this way it is possible to contin- context graph. While useful when the layers do not contain excessive numbers of nodes, as previously stated practical ually exploit back-crawling at a reasonable computational limitations sometimes prevent the storage of all documents cost.

    in the outermost layers. In these cases the class probabili- ties P c are set to a fixed constant value 1 C, where C is j 4 Experimental Results the number of layers. This corresponds to maximum likeli-

    hood selection of the winning layer. In our experience, per- The core improvement of our focused crawler derives from formance is not severely impacted by this simplification. the introduction of the context graph. We therefore com-

    5

     ?20 400 10 ?30 10 350 Context?Focused Crawler ?40 10Focused Crawler 300 Breadth?First Crawler ?50 10 250 ?60 10 ?70200 10Relevance ?80150 10 ?90 10100 Context Focused Crawler Retrieved Target Documents Focused Crawler?100 10 Breadth?First Crawler50 ?110 10 200 2200 4200 6200 8200 10200 12200 0 Number of Downloads 0 2000 4000 6000 8000 10000 12000 Number of Dowloads

     Figure 5: The average relevance of the downloaded docu- Figure 4: Both the Context Focused Crawler and the stan- ments computed using a sliding window of 200 downloads, dard focused crawler are orders of magnitude more ef- illustrating the improved capability of the Context Focused ficient than a traditional breadth-first crawler when re- Crawler to remain on-topic. trieving Call for Papers” documents. The Context Fo- cused Crawler outperforms the standard crawler, retrieving each case the Context Focused Crawler maintains a signif- 50 60% more “on-topic” documents in a given period. icantly higher level of relevance than either of the other two crawlers, reflecting the ability of the crawler to use off- pare the efficiency of our Context Focused Crawler to two topic paths to find new sets of promising documents. crawlers: Our experiments showed that the overhead due to the initialization stage is negligible, especially when the over- a breadth-first crawler, using the classifier constructed all higher efficiency of the crawler is taken into account. by the Context Focused Crawler on the seed set; The improvement that results from querying for the par- a traditional focused crawler which does not use the ents of every on-topic document when it is discovered can context to search targets. This crawler evaluates all be seen in Figure 6. This approach shows bursts of rapid re- children of the current document using the same clas- trieval when back-crawling from a target document yields sifier used by the Context Focused Crawler for the a hub site. We note that for general use we ration the num- seed set, and schedules crawls based on the document ber of back-link requests to avoid over-taxing the search with the highest score. engines. Figure 7 shows a different category, “Biking”, for which As a test set we performed a focused crawl for confer- our focused crawler showed the least average performance ence announcements, and, in particular, for the “Call for improvement over standard focused crawling (although it Papers” section. We used ten examples as a seed set and still significantly outperforms the standard crawler over constructed ten context graphs of depth four. We limited most of the trial). We found that such difficult categories the number of documents in a single layer of the context are those where target content is not reliably co-located graph to 300 . The resulting data was used to learn the 5 with pages from a different category, and where common Naive Bayes classifiers associated with the five queues. hierarchies do not exist or are not implemented uniformly To evaluate our algorithm we used the accepted metric across different web-sites. It is therefore to be expected of measuring the fraction of pages that are on-topic as a that the context graph provides less guidance in such cases. function of the number of download requests. The results However, due to our architecture design, and as illustrated are shown in the Figure 4. Both of the focused crawlers by Figure 4 and Figure 7, the performance will at worst significantly outperform the standard breadth-first crawler. approach that of the standard focused crawling approach. However the Context Focused Crawler has found on av- erage 50-60% more on-topic documents than the standard focused crawler on the “Conference” task. 5 Discussion The ability of the crawlers to remain focused on “on-

    topic” documents can also fruitfully be measured by com- We presented a focused crawler that models the links and puting the average relevance of the downloaded documents. content of documents that are closely linked to target pages The relevance of a document is equated to the likelihood to improve the efficiency with which content related to a that the document has been generated by the Naive Bayes desired category can be found. Our experiments show that model of the seed set. In Figure 5 we show the average the Context Focused Crawler improves the efficiency of tra-

    relevance using a sliding window of 200 downloads. In ditional focused crawling significantly (on average about

    6

    240 110 100 Context Focused Crawler 210Focused Crawler Context Focused Crawler 90 Context Focused Crawler With BackLinks 180 80 70 150 60 120 50 90 40 30 60 On?topic retrieved documents 20 30 10 0 00 200 400 600 800 1000 1200 1400 1600 0 1000 2000 3000 4000 5000 Downloads Downloads

    Figure 7: Performance of focused crawlers on the category Figure 6: Performance of Context Focused Crawler as

    “Biking”. compared to Context Focused Crawler with BackLinks, where parents of each on-topic document are obtained from ducing another parameter). Online parameter updating of the search engines immediately after the document is dis- the classifiers using the EM approach [19] should result covered. The task is retrieval of “Call for Papers” docu- in more efficient continuous optimization of the classifier ments on the web. performance.

    50-60%), and standard breadth-first crawling by orders of Not only can our approach be used for background gath- magnitude. ering of web material, the computational and bandwidth re- The major limitation of our approach is the requirement quirements of the crawler are sufficiently modest for the for reverse links to exist at a known search engine for a crawler to be used in an interactive session over a DSL reasonable fraction of the seed set documents. In practice, or cable modem connection on a home PC. The focused this does not appear to be problem. However, even when no crawler can therefore be used as a valuable supplement to, seed documents have yet been indexed by search engines, and in some cases a replacement for, standard search engine

    the approach can be bootstrapped. In this case a content database queries. We have no doubt that further improve- model of the seed set is extracted and other high-confidence ment of focused crawling will soon make crawling not only Retrieved Traget Documents target data can be found using query modifications on a the privilege of large companies that can afford expensive search engine. The indexed target data pages returned by infrastructures, but a personal tool that is widely available the search engine can then be used to build context graphs. for retrieving information on the world wide web.

     The system can also start with a breadth-first crawl and a

    set of example documents for training the level 0 classifier, References with context graphs being built once a minimum number of [1] Web surpasses one billion documents: Inktomi/NEC target documents have been recovered. press release.available at http://www.inktomi.com, The architecture we presented already yields very good Jan 18 2000. performance, but can be fruitfully improved. Major im- provements should result from more sophisticated ap- [2] S. Chakrabarti, M. van der Berg, and B. Dom, Fo- proaches for setting the confidence thresholds of the clas- cused crawling: a new approach to topic-specific web sifiers. We are currently investigating other machine learn- resource discovery, in Proc. of the 8th International ing algorithms for setting these thresholds. A promising World-Wide Web Conference (WWW8), 1999. approach maintains a graph representation of the current [3] J. Cho, H. Garcia-Molina, and L. Page, “Efficient crawl, which a separate on-line process uses to quantify crawling through URL ordering, in Proceedings of the effect of different queue thresholds. the Seventh World-Wide Web Conference, 1998. Other areas of interest are the extension of the feature space to include page analysis of HTML structure, using [4] A. McCallum, K. Nigam, J. Rennie, and K. Sey- the statistics gained during the search to develop ranking more, “Building domain-specic search engines with procedures for presenting results to the user, and perform- machine learning techniques,” in Proc. AAAI Spring ing online classifier adaptation. At present adaptation in Symposium on Intelligent Agents in Cyberspace, our system is restricted to periodically incorporating the 1999. context graphs of newly discovered target documents that are found in queue 0, and re-building the classifier (the [5] A. K. McCallum, K. Nigam, J. Rennie, and K. Sey- tests in this paper did not use this feature, to avoid intro- more, Automating the construction of internet por-

    7

    tals with machine learning, To appear in Information [19] A. Dempster, N. Laird, and D. Rubin, “Maximum

    likelihood from incomplete data via the EM algo- Retrieval. rithm, J. R. Statist. Soc. B, vol. 39, pp. 185197, [6] M. Gori, M. Maggini, and F. Scarselli, 1977. “http://nautilus.dii.unisi.it.

    [7] O. Heinonen, K. Hatonen, and K. Klemettinen,

    “WWW robots and search engines. Seminar on Mo-

    bile Code, Report TKO-C79, Helsinki University of

    Technology, Department of Computer Science, 1996.

    [8] S. Chakrabarti, B. Dom, P. Raghavan, S. Rajagopalan,

    D. Gibson, and J. Kleinberg, Automatic resource

    compilation by analyzing hyperlink structure and as-

    sociated text, in Proc. 7th World Wide Web Confer-

    ence, Brisbane, Australia, 1998.

    [9] K. Bharat and M. Henzinger, “Improved algorithms

    for topic distillation in hyperlinked environments,

    in Proceedings 21st Int’l ACM SIGIR Conference.,

    1998.

    [10] J. Kleinberg, Authoritative sources in a hyperlinked

    environment. Report RJ 10076, IBM, May 1997,

    1997.

    [11] S. Chakrabarti, M. van den Berg, and B. Dom, “Dis-

    tributed hypertext resource discovery through exam-

    ples, in VLDB’99, Proceedings of 25th International

    Conference on Very Large Data Bases, September 7-

    10, 1999, Edinburgh, Scotland, UK (M. P. Atkinson,

    M. E. Orlowska, P. Valduriez, S. B. Zdonik, and M. L.

    Brodie, eds.), pp. 375386, Morgan Kaufmann, 1999.

    [12] J. Rennie and A. McCallum, Using reinforcement

    learning to spider the web efficiently, in Proc. Inter-

    national Conference on Machine Learning (ICML),

    1999.

    [13] K. Bharat, A. Broder, M. Henzinger, P. Kumar, and

    S. Venkatasubramanian, “The connectivity server:

    Fast access to linkage information on the web,

    WWW7/ Computer Networks 30(1-7), pp. 469477,

    1998.

    [14] S. Chakrabarti, D. Gidson, and K. McCurley, “Surfing

    backwards on the web,” in Proc 8th World Wide Web

    Conference (WWW8), 1999.

    [15] G. Salton and M. J. McGill, An Introduction to Mod-

    ern Information Retrieval. McGraw-Hill, 1983.

    [16] M. Porter, An algorithm for suffix stripping, Pro-

    gram, vol. 14, no. 3, pp. 130137, 1980.

    [17] T. M. Mitchell, Machine Learning. McGraw-Hill,

    1997.

    [18] K. Nigam, A. McCallum, S. Thrun, and T. Mitchell,

    Text classification from labeled and unlabelled doc-

    uments using EM. To appear in Machine Learning,

     1999.

    8

Report this document

For any questions or suggestions please email
cust-service@docsford.com