Inducing Domaing-specific Semantic class taggers from(Almost)Nothing

By Jonathan Mitchell,2014-05-18 17:36
14 views 0
Inducing Domaing-specific Semantic class taggers from(Almost)Nothing

    In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics (ACL 2010)

    Inducing Domain-specific Semantic Class Taggers from (Almost) Nothing

Ruihong Huang and Ellen Riloff

    School of Computing

    University of Utah

    Salt Lake City, UT 84112


    sometimes temporal and numeric expressions as Abstract

    well. The mention detection task was introduced This research explores the idea of inducing in recent ACE evaluations (e.g., (ACE, 2007; domain-specific semantic class taggers us- ACE, 2008)) and requires systems to identify all ing only a domain-specific text collection noun phrases (proper names, nominals, and pro- and seed words. The learning process be- nouns) that correspond to 5-7 semantic classes. gins by inducing a classifier that only has Despite widespread interest in semantic tag- access to contextual features, forcing it to ging, nearly all semantic taggers for comprehen- generalize beyond the seeds. The contex- sive NP tagging still rely on supervised learn- tual classifier then labels new instances, ing, which requires annotated data for training. to expand and diversify the training set. A few annotated corpora exist, but they are rela- Next, a cross-category bootstrapping pro- tively small and most were developed for broad- cess simultaneously trains a suite of clas- coverage NLP. Many domains, however, are re- sifiers for multiple semantic classes. The plete with specialized terminology and jargon that positive instances for one class are used as cannot be adequately handled by general-purpose negative instances for the others in an it- systems. Domains such as biology, medicine, and erative bootstrapping cycle. We also ex- law are teeming with specialized vocabulary that plore a one-semantic-class-per-discourse necessitates training on domain-specific corpora. heuristic, and use the classifiers to dynam- Our research explores the idea of inducing ically create semantic features. We eval- domain-specific semantic taggers using a small uate our approach by inducing six seman- set of seed words as the only form of human su- tic taggers from a collection of veterinary pervision. Given an (unannotated) collection of medicine message board posts. domain-specific text, we automatically generate 1 Introduction training instances by labelling every instance of a seed word with its designated semantic class. We The goal of our research is to create semantic class then train a classifier to do semantic tagging using taggers that can assign a semantic class label to ev- these seed-based annotations, using bootstrapping ery noun phrase in a sentence. For example, con-

    to iteratively improve performance. sider the sentence: "The lab mix was diagnosed

    On the surface, this approach appears to be with parvo and given abx". A semantic tagger

    a should identify the "the lab mix" as an ANIMAL,

    contradiction. The classifier must learn how to as- "parvo" as a DISEASE, and "abx" (antibiotics)

    sign different semantic tags to different instances as a DRUG. Accurate semantic tagging could be

    of the same word based on context (e.g., beneficial for many NLP tasks, including coref-

    "lab" erence resolution and word sense disambiguation,

    may refer to an animal in one context but a labora- and many NLP applications, such as event extrac-

    tory in another). And yet, we plan to train the clas- tion systems and question answering technology.

    sifier using stand-alone seed words, making the as- Semantic class tagging has been the subject of

    sumption that every instance of the seed belongs to previous research, primarily under the guises of

    the same semantic class. We resolve this apparent named entity recognition (NER) and mention de-

    contradiction by using semantically unambiguous tection. Named entity recognizers perform se-

    seeds and by introducing an initial context-only mantic tagging on proper name noun phrases, and

    training phase before bootstrapping begins. First,

    we train a strictly contextual classifier that only

    has access to contextual features and cannot see mention detection systems (e.g., see (ACE, 2005; the seed. Then we apply the classifier to the corpus ACE, 2007; ACE, 2008)) require tagging of NPs to automatically label new instances, and combine that correspond to 5-7 general semantic classes. these new instances with the seed-based instances. These systems are typically trained with super- This process expands and diversifies the training vised learning using annotated corpora, although set to fuel subsequent bootstrapping. techniques have been developed to use resources

    for one language to train systems for different lan- Another challenge is that we want to use a small

    guages (e.g., (Zitouni and Florian, 2009)). set of seeds to minimize the amount of human ef-

    fort, and then use bootstrapping to fully exploit Another line of relevant work is semantic class the domain-specific corpus. Iterative self-training, induction (e.g., (Riloff and Shepherd, 1997; Roark however, often has difficulty sustaining momen- and Charniak, 1998; Thelen and Riloff, 2002; Ng, tum or it succumbs to semantic drift (Komachi 2007; McIntosh and Curran, 2009), where the goal et al., 2008; McIntosh and Curran, 2009). To is to induce a stand-alone dictionary of words with address these issues, we simultaneously induce semantic class labels. These techniques are of- a suite of classifiers for multiple semantic cat- ten designed to learn specialized terminology from egories, using the positive instances of one se- unannotated domain-specific texts via bootstrap- mantic category as negative instances for the oth- ping. Our work, however, focuses on classifica- ers. As bootstrapping progresses, the classifiers tion of NP instances in context, so the same phrase gradually improve themselves, and each other, may be assigned to different semantic classes in over many iterations. We also explore a one- different contexts. Consequently, our classifier semantic-class-per-discourse (OSCPD) heuristic can also assign semantic class labels to pronouns. that infuses the learning process with fresh train- There has also been work on extracting seman- ing instances, which may be substantially differ- tically related terms or category members from ent from the ones seen previously, and we use the the Web (e.g., (Pas?ca, 2004; Etzioni et al., 2005; labels produced by the classifiers to dynamically Kozareva et al., 2008; Carlson et al., 2009)). These create semantic features. techniques harvest broad-coverage semantic infor- We evaluate our approach by creating six se- mation from the Web using patterns and statistics, mantic taggers using a collection of message board typically for the purpose of knowledge acquisi- posts in the domain of veterinary medicine. Our tion. Importantly, our goal is to classify instances results show this approach produces high-quality in context, rather than generate lists of terms. In semantic taggers after a sustained bootstrapping addition, the goal of our research is to learn spe- cycle that maintains good precision while steadily cialized terms and jargon that may not be common increasing recall over many iterations. on the Web, as well as domain-specific usages that may differ from the norm (e.g., "mix" and "lab" 2 Related Work ANIMALS in our domain). are usually

    Semantic class tagging is most closely related to The idea of simulataneously learning multiple named entity recognition (NER), mention detec- semantic categories to prevent semantic drift has tion, and semantic lexicon induction. NER sys- been explored for other tasks, such as semantic tems (e.g., (Bikel et al., 1997; Collins and Singer, lexicon induction (Thelen and Riloff, 2002; McIn- 1999; Cucerzan and Yarowsky, 1999; Fleischman tosh and Curran, 2009) and pattern learning (Yan- and Hovy, 2002) identify proper named entities, garber, 2003). Our bootstrapping model can be such as people, organizations, and locations. Sev- viewed as a form of self-training (e.g., (Ng eral bootstrapping methods for NER have been and previously developed (e.g., (Collins and Singer, Cardie, 2003; Mihalcea, 2004; McClosky et al., 1999; Niu et al., 2003)). NER systems, how- 2006)), and cross-category training is similar in ever, do not identify nominal NP instances (e.g., spirit to co-training (e.g., (Blum and Mitchell, "a software manufacturer" or "the beach"), or han- 1998; Collins and Singer, 1999; Riloff and Jones, dle semantic classes that are not associated with 1999; Mueller et al., 2002; Phillips and Riloff, 1proper named entities (e.g., symptoms). ACE 2002)). But, importantly, our classifiers all use the same feature set so they do not represent indepen- 1Some NER systems also handle specialized constructs dent views of the data. They do, however, offer such as dates and monetary amounts.

    slightly different perspectives because each is at-

    training set. Second, we employ a cross-category tempting to recognize a different semantic class.

     bootstrapping process that simultaneously trains

    3 Bootstrapping an Instance-based a suite of classifiers for multiple semantic cate-

    Semantic Class Tagger from Seeds gories, using the positive instances for one se- mantic class as negative instances for the oth- 3.1 Motivation ers. This cross-category training process gives Our goal is to create a bootstrapping model that the learner sustained momentum over many boot- can rapidly create semantic class taggers using strapping iterations. Finally, we explore two ad- just a small set of seed words and an ditional enhancements: (1) a one-semantic-class- unanno- per-discourse heuristic to automatically generate tated domain-specific corpus. Our motivation new training instances, and (2) dynamically cre- comes from specialized domains that cannot be ated semantic features produced by the classifiers adequately handled by general-purpose NLP sys- themselves. In the following sections, we explain tems. As an example of such a domain, we have each of these steps in detail. been working with a collection of message board posts in the field of veterinary medicine. Given a 3.2 Phase 1: Inducing a Contextual Classifier document, we want a semantic class tagger to label

    The main challenge that we faced was how to train every NP with a semantic category, for example: owned by [A 14yo doxy] an instance-based classifier using seed words as ANIMAL

    is be- [a reputable breeder] the only form of human supervision. First, the user HUMAN

    with ing treated for [IBD] must provide a small set of seed words that DISEASE

    are [pred]. DRUG

    relatively unambiguous (e.g., "dog" will nearly When we began working with these texts, we always refer to an animal in our domain). were immediately confronted by a dizzying array But of non-standard words and word uses. In addition even so, training a traditional classifier from seed- to formal veterinary vocabulary (e.g., animal dis- based instances would likely produce a classifier eases), veterinarians often use informal, shorthand that learns to recognize the seeds but is unable to terms when posting on-line. For example, they classify new examples. We needed to force the frequently refer to breeds using "nicknames" or classifier to generalize beyond the seed words. shortened terms (e.g., gshep for German shepherd, Our solution was to introduce an initial train- doxy for dachsund, bxr for boxer, labx for labrador ing step that induces a strictly contextual classifier. cross). Often, they refer to animals based solely on First, we generate training instances by automati- their physical characteristics, for example "a dlh" cally labeling each instance of a seed word with (domestic long hair), "a m/n" (male, neutered), or its designated semantic class. However, when we "a 2yo" (2 year old). This phenomenon occurs create feature vectors for the classifier, the seeds with other semantic categories as well, such as themselves are hidden and only contextual fea- drugs and medical tests (e.g., pred for prednisone, tures are used to represent each training instance. and rads for radiographs). By essentially "masking" the seed words so the Nearly all semantic class taggers are trained us- classifier can only see the contexts around them, ing supervised learning with manually annotated we force the classifier to generalize. data. However, annotated data is rarely available

    We create a suite of strictly contextual classi- for specialized domains, and it is expensive to ob-

    fiers, one for each semantic category. Each classi- tain because domain experts must do the annota-

    fier makes a binary decision as to whether a noun tion work. So we set out to create a bootstrapping

    phrase belongs to its semantic category. We use model that can rapidly create domain-specific se-

    the seed words for category C to generate posi- mantic taggers using just a few seed words and a k

    tive training instances for the C classifier, and the domain-specific text collection. kseed words for all other categories to generate the Our bootstrapping model consists of two dis-

    negative training instances for C . tinct phases. First, we train strictly contextual k

    We use an in-house sentence segmenter and NP classifiers from the seed annotations. We then ap-

    chunker to identify the base NPs in each sentence ply the classifiers to the unlabeled data to gener-

    and create feature vectors that represent each con- ate new annotated instances that are added to the

    stituent in the sentence as either an NP or an in-

    dividual word. For each seed word, the feature

    vector captures a context window of 3 constituents This process greatly enhances the diversity of (word or NP) to its left and 3 constituents (word the training data. In this initial learning step, or NP) to its right. Each constituent is represented the strictly contextual classifiers substantially in- with a lexical feature: for NPs, we use its crease the number of training instances for

    head each

    noun; for individual words, we use the word itself. semantic category, producing a more diverse mix The seed word, however, is discarded so that the of seed-generated instances and context-generated classifier is essentially blind-folded and cannot see instances.

     the seed that produced the training instance. We

    3.3 Phase 2: Cross-Category Bootstrapping also create a feature for every modifier that pre-

    cedes the head noun in the target NP, except for The next phase of the learning process is an iter- articles which are discarded. As an example, con- ative bootstrapping procedure. The key challenge sider the following sentence: was to design a bootstrapping model that would

     not succumb to semantic drift and would have sus- Fluffy was diagnosed with FELV after a tained momentum to continue learning over many blood test showed that he tested positive. iterations. Figure 1 shows the design of our cross-category Suppose that "FELV" is a seed for the DISEASE 2 category and "test" is a seed for the TEST cate- bootstrapping model. We simultaneously train a gory. Two training instances would be created, suite of binary classifiers, one for each semantic with feature vectors that look like this, where M category, C .. . C . After each training cycle, 1 nrepresents a modifier inside the target NP: all of the classifiers are applied to the remaining diagnosedwithwas af ter test2 2 1 3 1 unlabeled instances and each classifier labels the ? DISEASE showed3 (positive) instances that it is most confident about (i.e., the instances that it classifies with a af ter bloodM showed1 F ELVwith 2 1 3 confi- he ;? TEST that 2 3dence score; ( ). The set of instances positively cfThe contextual classifiers are then applied to the +labeled by classifier C are shown as C in Figure kkcorpus to automatically label new instances. We 1. All of the new instances produced by classifier use a confidence score to label only the instances Ck are then added to the set of positive training that the classifiers are most certain about. We com- instances for C and to the set of negative training kpute a confidence score for instance i with respect instances for all of the other classifiers.

    One potential problem with this scheme is that to semantic class C by considering both the score k

    some categories are more prolific than others, plus of the C classifier as well as the scores of kwe are collecting negative instances from a set the

    competing classifiers. Intuitively, we have confi- of competing classifiers. Consequently, this ap-

    proach can produce highly imbalanced training dence in labeling an instance as category C if the ksets. Therefore we enforced a 3:1 ratio of nega- classifier gave it a positive score, and its score C kis much higher than the score of any other classi- tives to positives by randomly selecting a subset fier. We use the following scoring function: of the possible negative instances. We discuss this

     issue further in Section 4.4. score(i,C ) - max( score(i,C )) k ;Confidence(i,C) = seeds k =j j k+ + + + + + C C C C i=n i=1 C Ci=2 1 n _ unlabeled 2 _ We employ support vector machines _ ( (+) ( ) ( (+) (+) ) (SVMs) ) C C C 1 n (Joachims, 1999) with a linear kernel as our classi- 2 + + fiers, using the SVMlin software (Keerthi and De- + C C C 1 2 n Coste, 2005). We use the value produced by the labeled

    decision function (essentially the distance from

    the hyperplane) as the score for a classifier. We Figure 1: Cross-Category Bootstrapping

     specify a threshold ( and only assign a semantic cf2For simplicity, this picture does not depict the initial con- tag C to an instance i if Confidence(i,C ); (. k kcftextual training step, but that can be viewed as the first itera- All instances that pass the confidence thresh- tion in this general framework.

    old are labeled and added to the training set.

    Cross-category training has two advantages contexts, thereby infusing the bootstrapping pro- over independent self-training. First, as oth- cess with "fresh" training examples. ers have shown for pattern learning and lexicon In early experiments, we found that OSCPD can induction (Thelen and Riloff, 2002; Yangarber, be aggressive, pulling in many new instances. If 2003; McIntosh and Curran, 2009), simultane- the classifier labels a word incorrectly, however, ously training classifiers for multiple categories then the OSCPD heuristic will compound the er- reduces semantic drift because each classifier is ror and mislabel even more instances incorrectly. deterred from encroaching on another one's terri- Therefore we only apply this heuristic to instances tory (i.e., claiming the instances from a compet- that are labeled with extremely high confidence ing class as its own). Second, similar in spirit to (;; 2.5) and that pass a global sanity check, (cf 3co-training , this approach allows each classifier gsc(w); 0.2, which ensures that a relatively high to obtain new training instances from an outside proportion of labeled instances with the same head source that has a slightly different perspective. noun have been assigned to the same semantic w wu/c l/c While independent self-training can quickly run class. Specifically, gsc(w) = 0.1; +0.9; wl u wout of steam, cross-category training supplies each where w and w are the # of labeled and lu unla- classifier with a constant stream of new (negative) beled instances, respectively, w is the # of in- l/c instances produced by competing classifiers. In stances labeled as c, and w is the # of unlabeled u/c Section 4, we will show that cross-category boot- instances that receive a positive confidence score strapping performs substantially better than an in- for c when given to the classifier. The

    dependent self-training model, where each classi- intuition

    fier is bootstrapped separately. behind the second term is that most instances are The feature set for these classifiers is exactly the initially unlabeled and we want to make sure that same as described in Section 3.2, except that we many of the unlabeled instances are likely to be- add a new lexical feature that represents the head long to the same semantic class (even though the noun of the target NP (i.e., the NP that needs to be classifier isn't ready to commit to them yet).

     tagged). This allows the classifiers to consider the

    3.5 Dynamic Semantic Features local context as well as the target word itself when

    making decisions. For many NLP tasks, classifiers use semantic fea-

     tures to represent the semantic class of words. 3.4 One Semantic Class Per Discourse These features are typically obtained from exter-

    nal resources such as Wordnet (Miller, 1990). Our We also explored the idea of using a one semantic

    bootstrapping model incrementally trains seman- class per discourse (OSCPD) heuristic to gener-

    tic class taggers, so we explored the idea of using ate additional training instances during bootstrap-

    the labels assigned by the classifiers to create en- ping. Inspired by Yarowsky's one sense per dis-

    hanced feature vectors by dynamically adding se- course heuristic for word sense disambiguation

    mantic features. This process allows later stages (Yarowsky, 1995), we make the assumption that

    of bootstrapping to directly benefit from earlier multiple instances of a word in the same discourse

    stages. For example, consider the sentence: will nearly always correspond to the same seman-

     tic class. Since our data set consists of message He started the doxy on Vetsulin today. board posts organized as threads, we consider all posts in the same thread to be a single discourse. If "Vetsulin" was labeled as a DRUG in a previ- After each training step, we apply the classi- ous bootstrapping iteration, then the feature vector fiers to the unlabeled data to label some new in- representing the context around "doxy" can be en- stances. For each newly labeled instance, the OS- hanced to include an additional semantic feature CPD heuristic collects all instances with the same identifying Vetsulin as a DRUG, which would look head noun in the same discourse (thread) and uni- like this:

    laterally labels them with the same semantic class.

    He started on V etsulin DRUG 2 today3 1This heuristic serves as meta-knowledge to label 2 1 2 instances that (potentially) occur in very different Intuitively, the semantic features should help the classifier identify more general contextual pat- 3But technically this is not co-training because our feature terns, such as "started <X> on DRUG". To create sets are all the same.

    semantic features, we use the semantic tags that

have been assigned to the current set of labeled in- ing numbers. For training, we used 4,629 threads, stances. When a feature vector is created for a tar- consisting of 25,944 individual posts. We devel- get NP, we check every noun instance in its context oped classifiers to identify six semantic categories: window to see if it has been assigned a semantic 4ANIMAL, DISEASE/SYMPTOM, DRUG, HUMAN, tag, and if so, then we add a semantic feature. In TEST, and OTHER. the early stages of bootstrapping, however, rela- The message board posts contain an abundance tively few nouns will be assigned semantic tags, of veterinary terminology and jargon, so two do- so these features are often missing. 5main experts from VIN created a test set (answer key) for our evaluation. We defined annotation 3.6 Thresholds and Stopping Criterion 6guidelines for each semantic category and When new instances are automatically labeled con- during bootstrapping, it is critically important that ducted an inter-annotator agreement study to mea- most of the labels are correct or performance sure the consistency of the two domain experts on rapidly deteriorates. This suggests that we should 30 message board posts, which contained 1,473 only label instances in which the classifier has noun phrases. The annotators achieved a relatively high confidence. On the other hand, a high thresh- high score of .80. Each annotator then labeled an old often yields few new instances, which can additional 35 documents, which gave us a test set cause the bootstrapping process to sputter and halt. containing 100 manually annotated message board To balance these competing demands, we used posts. The table below shows the distribution of a sliding threshold that begins conservatively but To select seed words, we used the semantic classes in the test set. Animal Dis/Sym Drug Test Human Other procedure gradually loosens the reins as bootstrapping pro- 612 900 369 404 818 1723 proposed by Roark and Charniak (1998), ranking gresses. Initially, we set ( = 2. 0 , which cfall of the head nouns in the training corpus by fre- only quency and manually selecting the first 10 nouns labels instances that the classifier is highly confi- 7that unambiguously belong to each category. This dent about. When fewer than min new instances process is fast, relatively objective, and guaranteed can be labeled, we automatically decrease ( by cfto yield high-frequency terms, which is important 0.2, allowing another batch of new instances to be for bootstrapping. We used the Stanford part-of- labeled, albeit with slightly less confidence. We speech tagger (Toutanova et al., 2003) to identify continue decreasing the threshold, as needed, un- nouns, and our own simple rule-based NP chunker. til ( < 1 0. , when we end the cf bootstrapping 4.2 Baselines process. In Section 4, we show that this sliding To assess the difficulty of our data set and threshold outperforms fixed threshold values.


    4 Evaluation we evaluated several baselines. The first baseline searches for each head noun in WordNet and la- 4.1 Data bels the noun as category C if it has a hypernym kOur data set consists of message board posts from synset corresponding to that category. We manu- the Veterinary Information Network (VIN), which ally identified the WordNet synsets that, to the best is a web site ( for professionals in of our ability, seem to most closely correspond veterinary medicine. Among other things, VIN 4We used a single category for diseases and hosts forums where veterinarians engage in dis-

    symptoms cussions about medical issues, cases in their prac- because our domain experts had difficulty distinguishing be- tices, etc. Over half of the small animal veterinar- tween them. A veterinary consultant explained that the same term (e.g., diabetes) may be considered a symptom in ians in the U.S. and Canada use VIN. Analysis of one veterinary data could not only improve pet health context if it is secondary to another condition (e.g., pancre- care, but also provide early warning signs of in- atitis) and a disease in a different context if it is the primary diagnosis. fectious disease outbreaks, emerging zoonotic dis- 5One annotator is a veterinarian and the other is a veteri- eases, exposures to environmental toxins, and con- nary technician. tamination in the food chain. 6The annotators were also allowed to label an NP

    We obtained over 15,000 VIN message board as

    POS Error if it was clearly misparsed. These cases were not threads representing three topics: cardiology, en- used in the evaluation. docrinology, and feline internal medicine. We did 7We used 20 seeds for DIS/SYM because we merged two basic cleaning, removing html tags and tokeniz- categories and for OTHER because it is a broad catch-all class.

    Dis/Sym Drug Method Animal Test Human Other Avg


    21/81/34 25/35/29 WordNet 32/80/46 NA 62/66/64 NA 35/66/45.8 Seeds 38/100/55 14/99/25 21/97/35 29/94/45 80/99/88 18/93/30 37/98/53.1 Supervised 67/94/78 20/88/33 24/96/39 34/90/49 79/99/88 31/91/46 45/94/60.7

    Ind. Self-Train I.13 61/84/71 39/80/52 53/77/62 55/70/61 81/96/88 30/82/44 58/81/67.4


    Contextual I.1 33/80/47 53/82/64.3 59/77/67 33/84/47 42/80/55 49/77/59 82/93/87 XCategory I.45 86/71/78 57/82/67 70/78/74 73/65/69 85/92/89 46/82/59 75/78/76.1

    XCat+OSCPD I.40 86/69/77 59/81/68 72/70/71 72/69/71 86/92/89 50/81/62 75/76/75.6 86/70/77 60/81/69 69/81/75 73/69/71 86/91/89 XCat+OSCPD+SF I.39 50/81/62 75/78/76.6

    Table 1: Experimental results, reported as Recall/Precision/F score

    to each semantic class. We do not report Word- which trains only the strictly contextual classi- Net results for TEST because there did not seem fiers. The average F score improved from 53.1 for be an appropriate synset, or for the OTHER cate- the seeds alone to 64.3 with the contextual classi- gory because that is a catch-all class. The first row fiers. The next row, XCategory I.45, shows the of Table 1 shows the results, which are reported results after cross-category bootstrapping, which

     8as Recall/Precision/F score . The WordNet base- ended after 45 iterations. (We indicate the num- line yields low recall (21-32%) for every category ber of iterations until bootstrapping ended using except HUMAN, which confirms that many veteri- the notation I.#.) With cross-category bootstrap- nary terms are not present in WordNet. The sur- ping, the average F score increased from 64.3 to prisingly low precision for some categories is due 76.1. A closer inspection reveals that all of the se- to atypical word uses (e.g., patient, boy, and girl mantic categories except HUMAN achieved large are HUMAN in WordNet but nearly always ANI- recall gains. And importantly, these recall gains MALS in our domain), and overgeneralities (e.g., were obtained with relatively little loss of preci- WordNet lists calcium as a DRUG). sion, with the exception of TEST. The second baseline simply labels every in- Next, we measured the impact of the one- stance of a seed with its designated semantic class. semantic-class-per-discourse heuristic, shown as All non-seed instances remain unlabeled. As ex- XCat+OSCPD I.40. From Table 1, it appears that pected, the seeds produce high precision but low OSCPD produced mixed results: recall increased recall. The exception is HUMAN, where 80% of by 1-4 points for DIS/SYM, DRUG, HUMAN, and the instances match a seed word, undoubtedly be- OTHER, but precision was inconsistent, improv- cause five of the ten HUMAN seeds are 1st and 2nd ing by +4 for TEST but dropping by -8 for DRUG. person pronouns, which are extremely common. However, this single snapshot in time does not tell A third baseline trains semantic classifiers using the full story. Figure 2 shows the performance supervised learning by performing 10-fold cross- of the classifiers during the course of bootstrap- validation on the test set. The feature set ping. The OSCPD heuristic produced a steeper and learning curve, and consistently improved perfor- classifier settings are exactly the same as with mance until the last few iterations when its perfor- 9our bootstrapped classifiers. Supervised learning mance dipped. This is probably due to the fact that achieves good precision but low recall for all cate- noise gradually increases during bootstrapping, so gories except ANIMAL and HUMAN. In the next incorrect labels are more likely and OSCPD will section, we present the experimental results for compound any mistakes by the classifier. A good our bootstrapped classifiers. future strategy might be to use the OSCPD heuris- tic only during the early stages of bootstrapping 4.3 Results for Bootstrapped Classifiers when the classifier's decisions are most reliable. The bottom section of Table 1 displays the results We also evaluated the effect of dynamically cre- for our bootstrapped classifiers. The Contextual ated semantic features. When added to the ba- I.1 row shows results after just the first iteration, sic XCategory system, they had almost no ef- 8fect. We suspect this is because the semantic fea- We use an F(1) score, where recall and precision

    are tures are sparse during most of the bootstrapping equally weighted. process. However, the semantic features did im- 9For all of our classifiers, supervised and bootstrapped,

    we label all instances of the seed words first and then apply

    the classifiers to the unlabeled (non-seed) instances.

78 85

    80 76

    75 74 70 F me a su re (% )

    72 65

    60 70

    Precision 55 68 Recall independent selftraining 50 20 35 5 10 15 40 25 30 0 crosscategory bootstrapping # of iterations 66 +OSCPD

    +OSCPD+SemFeat 64 Figure 3: Recall and Precision scores during 5 10 15 20 45 25 30 35 0 40 # of iterations cross-category bootstrapping

    Figure 2: Average F scores after each iteration call steadily improves while precision stays con-

    sistently high with only a slight dropoff at the end.

     prove performance when coupled with the OSCPD

    4.4 Analysis heuristic, presumably because the OSCPD heuris-

    tic aggressively labels more instances in the earlier To assess the impact of corpus size, we generated stages of bootstrapping, increasing the prevalence a learning curve with randomly selected subsets of semantic class tags. The XCat+OSCPD+SF of the training texts. Figure 4 shows the average F 1 ,,,1 , I.39 row in Table 1 shows that the semantic fea- score of our best system using , and 16 8 1 3 21 1 th of the training set, tures coupled with OSCPD dramatically increased 44all of the data. With just 16 the precision for DRUG, yielding the best overall F the system has about 1,600 message board posts score of 76.6. to use for training, which yields a similar F score We conducted one additional experiment to as- (roughly 61%) as the supervised baseline that used sess the benefits of cross-category bootstrapping. 100 manually annotated posts via 10-fold cross- We created an analogous suite of classifiers using validation. So with 16 times more text, seed-based self-training, where each classifier independently bootstrapping achieves roughly the same results as labels the instances that it is most confident about, supervised learning. This result re;ects the natural adds them only to its own training set, and then trade-off between supervised learning and seed- retrains itself. The Ind. Self-Train I.13 row in based bootstrapping. Supervised learning exploits Table 1 shows that these classifiers achieved only manually annotated data, but must make do with 58% recall (compared to 75% for XCategory) and a relatively small amount of training text because an average F score of 67.4 (compared to 76.1 for manual annotations are expensive. In contrast, XCategory). One reason for the disparity is that seed-based bootstrapping exploits a small number the self-training model ended after just 13 of human-provided seeds, but needs a larger set of boot- (unannotated) texts for training because the seeds strapping cycles (I.13), given the same threshold produce relatively sparse annotations of the texts. values. To see if we could push it further, we low- An additional advantage of seed-based boot- ered the confidence threshold to 0 and it continued strapping methods is that they can easily exploit learning through 35 iterations. Even so, its final unlimited amounts of training text. For many do- score was 65% recall with 79% precision, which is mains, large text collections are readily available. still well below the 75% recall with 78% precision Figure 4 shows a steady improvement in perfor- produced by the XCategory model. These results mance as the amount of training text grows. Over- support our claim that cross-category bootstrap- all, the F score improves from roughly 61%

    ping is more effective than independently self- to

    trained models. nearly 77% simply by giving the system access to Figure 3 tracks the recall and precision scores more unannotated text during bootstrapping. of the XCat+OSCPD+SF system as bootstrap- We also evaluated the effectiveness of our slid- ping progresses. This graph shows the sustained ing confidence threshold (Section 3.6). The ta-

    ble below shows the results using fixed thresholds momentum of cross-category bootstrapping: re-

    80 whelm the less frequent categories with negative 75 instances. 70 65 Neg:Pos R/P/F 60 1:1 72/79/75.2 F me a su re (% ) 2:1 74/78/76.1 3:1 75/78/76.6 40 4:1 75/77/76.0 5:1 76/77/76.4 20 Finally, we examined performance on gendered

    pronouns (he/she/him/her), which can refer to ei-

    ther animals or people in the veterinary domain. 0 1/2 0 1/16 1/8 1/4 3/4 1 ration of data 84% (220/261) of the gendered pronouns were an-

     ANIMAL in the test set. Our classi- notated as

    fier achieved 95% recall (209/220) and 90% preci- Figure 4: Learning Curve

    sion (209/232) for ANIMAL and 15% recall (6/41)

    and 100% precision (6/6) for HUMAN. So while of 1.0, 1.5, 2.0, as well as the sliding it failed to recognize most of the (relatively few) threshold gendered pronouns that refer to a person, it was (which begins at 2.0 and ends at 1.0 decreasing by highly effective at identifying the ANIMAL refer- 0.2 when the number of newly labeled instances ences and it was always correct when it did assign falls below 3000 (i.e., < 500 per category, on av- a HUMAN tag to a pronoun. erage). This table depicts the expected trade-off between recall and precision for the fixed thresh- 5 Conclusions olds, with higher thresholds producing higher pre-

    We presented a novel technique for inducing cision but lower recall. The sliding threshold pro-

    domain-specific semantic class taggers from a duces the best F score, achieving the best balance

    handful of seed words and an unannotated text of high recall and precision. R/P/F ( collection. Our results showed that the induced cf 71/77/74.1 1.0 taggers achieve good performance on six seman- 69/81/74.7 1.5 tic categories associated with the domain of vet- 65/82/72.4 2.0 75/78/76.6 Sliding erinary medicine. Our technique allows seman-

    tic class taggers to be rapidly created for special- As mentioned in Section 3.3, we used 3 times

    ized domains with minimal human effort. In future as many negative instances as positive instances

    work, we plan to investigate whether these seman- for every semantic category during bootstrap-

    tic taggers can be used to improve other tasks. ping. This ratio was based on early experiments

     where we needed to limit the number of neg- Acknowledgments ative instances per category because the cross- category framework naturally produces an ex- We are very grateful to the people at the Veterinary tremely skewed negative/positive training set. We Information Network for providing us access to revisited this issue to empirically assess the impact their resources. Special thanks to Paul Pion, DVM of the negative/positive ratio on performance. The and Nicky Mastin, DVM for making their data table below shows recall, precision, and F score available to us, and to Sherri Lofing and

    results when we vary the ratio from 1:1 to 5:1. A Becky

    1:1 ratio seems to be too conservative, improving Lundgren, DVM for their time and expertise in precision a bit but lowering recall. However the creating the gold standard annotations. This re- difference in performance between the other ra- search was supported in part by Department of tios is small. Our conclusion is that a 1:1 ratio is Homeland Security Grant N0014-07-1-0152 and too restrictive but, in general, the cross-category Air Force Contract FA8750-09-C-0172 under the bootstrapping process is relatively insensitive to DARPA Machine Reading Program. the specific negative/positive ratio used. Our ob-

    References servation from preliminary experiments, however,

    is that the negative/positive ratio does need to be ACE. NIST ACE evaluation In controlled, or else the dominant categories over- website. 2005.

    Meeting of the Association for Computational Lin-

    guistics: Human Language Technologies (ACL-08). D. McClosky, E. Charniak, and M Johnson. 2006. Ef- Daniel M. Bikel, Scott Miller, Richard fective self-training for parsing. In HLT-NAACL- Schwartz, 2006. and Ralph Weischedel. 1997. Nymble: a ACE. 2007. NIST ACE evaluation website. In T. McIntosh and J. Curran. 2009. Reducing Semantic high- Drift with Bagging and Distributional Similarity. In performance learning name-finder. In Proceedings Proceedings of the 47th Annual Meeting of the As- of ANLP-97, pages 194-201. ACE. 2008. NIST ACE evaluation In sociation for Computational Linguistics. website. A. Blum and T. Mitchell. 1998. Combining Labeled R. Mihalcea. 2004. Co-training and Self-training for and Unlabeled Data with Co-Training. In Proceed- Word Sense Disambiguation. In CoNLL-2004. ings of the 11th Annual Conference on Computa- tional Learning Theory (COLT-98). G. Miller. 1990. Wordnet: An On-line Lexical Andrew Carlson, Justin Betteridge, Estevam R. Hr- Database. International Journal of Lexicography, uschka Jr., and Tom M. Mitchell. 2009. 3(4). Coupling semi-supervised learning of categories and relations. C. Mueller, S. Rapp, and M. Strube. 2002. Applying In HLT-NAACL 2009 Workshop on Semi-Supervised co-training to reference resolution. In Proceedings Learning for NLP. of the 40th Annual Meeting of the Association for M. Collins and Y. Singer. 1999. Unsupervised Computational Linguistics. Mod- els for Named Entity Classification. In Proceedings V. Ng and C. Cardie. 2003. Weakly supervised natural of the Joint SIGDAT Conference on Empirical Meth- language learning without redundant views. In HLT- ods in Natural Language Processing and Very Large NAACL-2003. Corpora (EMNLP/VLC-99). V. Ng. 2007. Semantic Class Induction and S. Cucerzan and D. Yarowsky. 1999. Language Corefer- In- ence Resolution. In Proceedings of the 45th Annual dependent Named Entity Recognition Combining Meeting of the Association for Computational Lin- Morphologi cal and Contextual Evidence. In Pro- guistics. ceedings of the Joint SIGDAT Conference on Empir- ical Methods in Natural Language Processing and Cheng Niu, Wei Li, Jihong Ding, and Rohini K. Very Large Corpora (EMNLP/VLC-99). Sri-

     hari. 2003. A bootstrapping approach to O. Etzioni, M. Cafarella, D. Downey, A. named Popescu, entity classification using successive learners. In T. Shaked, S. Soderland, D. Weld, and A. Proceedings of the 41st Annual Meeting on Associa- Yates. tion for Computational Linguistics (ACL-03), pages 2005. Unsupervised named-entity extraction from 335-342. the web: an experimental study. Artificial Intelli- M. Pas?ca. 2004. Acquisition of categorized named gence, 165(1):91-134, June. entities for web search. In Proc. of the

    M.B. Fleischman and E.H. Hovy. 2002. Fine grained Thirteenth

    classification of named entities. In Proceedings of ACM International Conference on Information and the COLING conference, August. Knowledge Management, pages 137-145. W. Phillips and E. Riloff. 2002. Exploiting T. Joachims. 1999. Making Large-Scale Strong Support

    Syntactic Heuristics and Co-Training to Learn Se- Vector Machine Learning Practical. In A. Smola

    mantic Lexicons. In Proceedings of the 2002 Con- B. Sch?olkopf, C. Burges, editor, Advances in

    ference on Empirical Methods in Natural Language Ker-

    Processing, pages 125-132. nel Methods: Support Vector Machines. MIT Press, Cambridge, MA. E. Riloff and R. Jones. 1999. Learning Dictionar- S. Keerthi and D. DeCoste. 2005. A Modified Finite ies for Information Extraction by Multi-Level Boot- Newton Method for Fast Solution of Large Scale strapping. In Proceedings of the Sixteenth National Linear SVMs. Journal of Machine Learning Re- Conference on Artificial Intelligence. search. E. Riloff and J. Shepherd. 1997. A Corpus-Based Ap- Mamoru Komachi, Taku Kudo, Masashi Shimbo, and proach for Building Semantic Lexicons. In Proceed- Yuji Matsumoto. 2008. Graph-based analysis of ings of the Second Conference on Empirical Meth- semantic drift in espresso-like bootstrapping algo- ods in Natural Language Processing, pages 117- rithms. In Proceedings of the 2008 Conference 124. on Empirical Methods in Natural Language Process- B. Roark and E. Charniak. 1998. Noun-phrase ing. Co- occurrence Statistics for Semi-automatic Semantic Z. Kozareva, E. Riloff, and E. Hovy. 2008. Semantic Lexicon Construction. In Proceedings of the 36th Class Learning from the Web with Hyponym Pattern Annual Meeting of the Association for Computa- Linkage Graphs. In Proceedings of the 46th Annual tional Linguistics, pages 1110-1116.

Report this document

For any questions or suggestions please email