DOC

# Learning to Identify Unexpected Instances in the Test Set

By Geraldine Wright,2014-11-26 13:26
11 views 0
Learning to Identify Unexpected Instances in the Test Set

Learning to Identify Unexpected Instances in the Test Set

Xiao-Li Li Bing Liu See-Kiong Ng

Institute for Infocomm Research, Computer Science Department, Institute for Infocomm Research,

21 Heng Mui Keng Terrace, University of Illinois at Chicago, 21 Heng Mui Keng Terrace,

Singapore, 119613 IL 60607-7053 Singapore, 119613

xlli@i2r.a-star.edu.sg liub@cs.uic.edu skng@i2r.a-star.edu.sg

1. Introduction

LGN Algorithm Baseline: PU learning Problem Definition

Intuition: 1. Positive feature w+ will have similar …….. P C C C 1distribution in both P 2n

and U.

2. If a feature in U has

significantly different

distribution in P and in

U, then it is likely a

negative feature w. -

…….. U C C C 21nC ?

Generate

Negative set A N

Problem: some of the test or future instances may not belong to any of the predefined

classes of the original training set! The test set may contain additional unknown

subclasses, or new subclasses may arise as the underlying domain evolves over time.

2. The Proposed LGN Algorithm

2.1 How to generate negative data

Idea: our strategy is to exploit the distribution’s difference between w and +

w to generate an effective set of artificial negative documents A. -N

An entropy-based method is used to estimate if a feature win U has significantly i

different conditional probabilities in P and in U (i.e, (Pr(w|+) and Pr(w|-))): The entropy iiequation is:

entropy(w)Pr(w|c)*log(Pr(w|c))iii c{;,} The entropy indicates whether a feature belongs to positive or negative class

w = w entropy(w) is small Pr(w|-) (wmainly occurring in U) is significantly i----

larger than Pr(w|+) -

w = w entropy(w) is large Pr(w|+) and Pr(w|-) are similar i+-++

We generate features for A based on the entropy information, weighted as follows: N

entropy(w)iq(w)1i max(entropy(w))j1,2...,|V|j

LGN generated artificial negative documents Aaccording to q(w). In this way, those N i

features that are deemed more discriminatory will be generated more frequently in A. N

w uniformly occurs in both P and U and i

we therefore do not generate w in A iNq(w)= 0 i

w is a negative feature and we generate i

it for A, based on its distribution in U. Nq(w)= 1 i

w is generated, following a Gaussian i

0<q(w)<1 distribution, according to q(w). ii

2.2 Build Final Classifier

An NB classifier with the positive set P and the generated single negative document A to Nidentify unexpected document instances.

Generate P

Negative set A N

3. Experiments

The benchmark 20 Newsgroup collection is used for evaluation.

100.0

80.0

60.0

LGNS-EMRoc-SVM40.0PEBLF-scoreOSVM

20.0

0.0

5%10%15%20%40%60%80%100%

a% of unexpected documents Different percentages of unexpected documents in U in the 2-classes experiments

100.0

80.0

60.0

LGNS-EMRoc-SVM40.0F-scorePEBLOSVM

20.0

0.0

5%10%15%20%40%60%80%100%

a% of unexpected documents

Different percentages of unexpected documents in U in the 3-classes experiments.

4. Conclusion

1. Unexpected instances may occur in real-world classification applications. 2. LGN performed significantly better than existing techniques when the proportion of unexpected instances is low. The method is also robust irrespective of the proportions of unexpected instances in test set.

Report this document

For any questions or suggestions please email
cust-service@docsford.com