DOC

# CSCE 822 Information as a Measure

By Alfred Harper,2014-08-09 01:48
11 views 0
CSCE 822 Information as a Measure

CSCE 822 Data Mining & Knowledge Discovery

1Information as a Measure

Several books were written in the late 50’s to formally define the concept of information as a measure of surprise (or lack of) or as

the uncertainty of outcomes. These books were inspired by earlier work by Shannon and Wiener, who independently arrived to the same expression for average information.

Let X be a random variable associated to a space having n

mutually exclusive events, such that

u

EU|E| = [e, e…e] so en;k, 123k1

n

p1|P| = [p, p p … p] so ?k,,123n1k

Let E(X) be some function such that, if experiments are conducted many times, the averages of X will approach E(X). Shannon and

Wiener suggested the expression below to quantify the average uncertainty (or chaos, or disorder, or entropy) associated to a complete sample space :

n

H(X)plogp ?ii1i

1 A measure is a rather precise definition (involving such things as -algebras) which makes it difficult to

understand for non-mathematicians such as myself. All we need to know here is that this and other definitions form the basis for much of what is known as Mathematical Analysis. For example, every definition of an integral is based on a particular measure. The study of measures and their application to integration is known as Measure Theory.

University of South Carolina 1 Juan E. Vargas- CSCE

CSCE 822 Data Mining & Knowledge Discovery For each event e there is a value, or quantity, x such that kk,

xlogP{e}logpkkk

The term log (p) is called the amount of self-information k

associated to the event eThe unit of information, called a bit k.

(binary unit) is equivalent to the amount of information associated

with selecting one event from the set. The average amount of information, called Entropy, is defined for a sample space of equally probable events E as: k

n

H(X)I(E)plogp ?kkk1k

If we have a fair coin, such that p(H) = p(T) = ?, then

11111H(E)H(EE)log()log()log1 bit k1222222

Note that I(E) = I(E) = -log(1/2) = 1 bit. Extending this example, 12

Nif we have a sample with 2 equally probable events, E k

N(k=1,2,…,2), then

N bits I(E)logplog(2)Nkk

Example:

I(A)E=[A, A] P=[1/256, 255/256] => = 0.0369 bit ak12

I(B)E=[B, B] P=[?, ?] =>= 1 bit bk12

Suggesting that it is easier to guess the value of A than to guess k

the value of Bk.

University of South Carolina 2 Juan E. Vargas- CSCE

CSCE 822 Data Mining & Knowledge Discovery

The measure H complies with the following axioms: 1. Continuity: If the probabilities of events change, the entropy

associated to the system changes accordingly.

2. Symmetry: H is invariant to the order of events, i.e.,

H(p,p,…p) = H(p,p,…p). 12n21n

3. Extremal Value: The value of H is largest when all events

are equally likely, because it is most uncertain which event

could occur.

th4. Additivity: H = H + pH when the m event is a 21mm

composition of other events.

The Shannon-Wiener formulation for Entropy gained popularity due to its simplicity and its axiomatic properties. To illustrate, consider an earlier definition of information, due to R. A. Fisher, who essentially defined it as an average second moment in a sample distribution with density f(x) and mean m:

~lnf(x)2 I[]f(x)dx~m

Thus, for example, expressing the normal distribution

1xm2()exp[()] fx22

University of South Carolina 3 Juan E. Vargas- CSCE

CSCE 822 Data Mining & Knowledge Discovery

in logarithmic form, deriving with respect to the mean, and taking the integral:

1xm2 log()ln2ln()fx22

~lnfxm 2~m

22(?xmxm11(? I)?dxexp[])?)?2222????

The Shannon-Wiener expression for information can be generalized for 2-dimensional probability schemes, and by induction, to any n-dimensional probability schemes. Let and 1

be two discrete sample spaces with sets {E} =[E,E,…,E] and 212n

{F} =[F,F,…,F]. 12m

We can have three complete sets of probability schemes:

P{E} = [P{E}] k

P{F} = [P{F}] j

P{EF}= [P{EF}] kj

University of South Carolina 4 Juan E. Vargas- CSCE

CSCE 822 Data Mining & Knowledge Discovery The joint probability matrix is given by:

p{1,1}p{1,2}....p{1,m}???p{2,1}p{2,2}...p{2.m}??[P{X,Y}] ??............??p{n,1}p{n,2}...p{n,m}???

We can obtain the marginal probabilities for each variable as in:

m

P{x}p{x,y}?kkj1j

p{1,1}p{1,2}....p{1,m}p(x1)???p{2,1}p{2,2}...p{2.m}p(x2)??[P{X,Y}]= ??............p(x3)??p{n,1}p{n,2}...p{n,m}p(x4)???

mP{x}p(x,y)p{x,y} ??kkj{,}\1xyyj

P{x}P{E}P{EF?EF?...?EF}p{1,1}p{1,2}...p{1,m}1111121m

and

nP{y}p(x,y)p{x,y} ??jkj{,}\1xyxk

From the matrix we can also compute the total and marginal

entropies, H(X), H(Y), H(X,Y)

nm

H(X,Y)p{k,j}logp{k,j} ??，，11kj

nmmn?(?)?H(X)p{k,j}logp{k,j}p{x}logp{x} ??????kk)?，，1111kjjk?????

University of South Carolina 5 Juan E. Vargas- CSCE

CSCE 822 Data Mining & Knowledge Discovery

mnnm?(? )?H(Y)p{k,j}logp{k,j}p{y}logp{y}??????jj)?，，1111??jkkj?

Note that to obtain H we must find the corresponding p(x) and k

p(y) first. To better understand the calculations involved in H(X,Y) j

versus H(X), and H(Y), let m=n=3. Then

nm

H(X,Y)p{k,j}logp{k,j}??，，11kj

;;;；H(X,Y)p(1,1)log(p(1,1)))p(1,2)log(p(1,2)))p(1,3)log(p(1,3))

;;;；p(2,1)log(p(2,1)))p(2,2)log(p(2,2)))p(2,3)log(p(2,3))

;;;；p(3,1)log(p(3,1)))p(3,2)log(p(3,2)))p(3,3)log(p(3,3))

while

nmmn?(?)?H(X)p{k,j}logp{k,j}p{x}logp{x} ??????kk)?，，1111kjjk?????

- [p(1,1)+p(1,2)+p(1,3)][log(p(1,1)+p(1,2)+p(1,3)] H(X)

- [p(2,1)+p(2,2)+p(2,3)][log(p(2,1)+p(2,2)+p(2,3)]

- [p(3,1)+p(3,2)+p(3,3)][log(p(3,1)+p(3,2)+p(3,3)]

We can also compute the conditional entropies. Due to the

m

1addition theorem of probability, the union of E is k?Ekk1

Therefore, marginalizing E k

n

FEF ;jkjk1

p(x,y)p(x|y)p(y)p(y|x)p(x)from Bayes theorem: , therefore: University of South Carolina 6 Juan E. Vargas- CSCE

CSCE 822 Data Mining & Knowledge Discovery

P{Xx?Yy}kj P{Xx|Yy}kjP{Yy}j

p{k,j} p{x|y}kjp{y}j

thwhere p{y} is the j marginal; and so j

mn

H(X|Y)p{x,y}logp{x|y)??kjkj，，11jk

nm

H(Y|X)p{x,y}logp{y|x)??kjjk，，11kj

Note again that the two equations imply that you have to compute

the marginals first. From Bayes theorem p(x,y)p(x|y)p(y)p(y|x)p(x)

we can write:

H(X,Y) = H(X|Y) + H(Y) = H(Y|X) + H(X)

Example: Two “honest” dice, X and Y, are thrown. Compute H(X,Y), H(X), H(Y), H(X|Y) and H(Y|X). The joint probability table is:

Y\X 1 2 3 4 5 6 e(x)

1 1/36 1/36 1/36 1/36 1/36 1/36 1/6

2 1/36 1/36 1/36 1/36 1/36 1/36 1/6

3 1/36 1/36 1/36 1/36 1/36 1/36 1/6

4 1/36 1/36 1/36 1/36 1/36 1/36 1/6

5 1/36 1/36 1/36 1/36 1/36 1/36 1/6

6 1/36 1/36 1/36 1/36 1/36 1/36 1/6

f(y) 1/6 1/6 1/6 1/6 1/6 1/6 1/1

The entropies can be calculated from the table: University of South Carolina 7 Juan E. Vargas- CSCE

CSCE 822 Data Mining & Knowledge Discovery

6611 bits H(X,Y)Ploglog5.17??ij3636ij，，11

611H(X)H(Y)Ploglog2.58 bits ?i66i1

661 bits H(X|Y)H(Y|X)Plog2.58??ij6ij，，11

A Measure of Mutual Information

We would like to formulate a measure for the mutual information between two symbols, (x,y). Solomon Kullback wrote in 1958 a ij

book on the study of logarithmic measures of information and their application to the testing of statistical hypotheses such as determining if two independent, random samples, were drawn from the same population, or if the samples are conditionally independent, etc.

Let H (i=1,2) be the hypothesis that X is from a population with a i

probability measure . Applying Bayes theorem: i

P(H)fi(x)iP(H|x) for i=1,2 iP(H)f(x)P(H)f(x)1122

University of South Carolina 8 Juan E. Vargas- CSCE

CSCE 822 Data Mining & Knowledge Discovery

Expanding P(H|x) for i=1,2 solving for f1 and f2 and simplifying i

f(x)P(H|x)P(H)112 f(x)P(H|x)P(H)221

taking the log we obtain

f(x)P(H|x)P(H)111logloglog f(x)P(H|x)P(H)222

The right side of the equation is a measure of the difference between the odds in favor of H1 after the observation X=x and before the observation. Kullback defined this expression as the information in X=x for discriminating in favor of H1 against H2.

The mean information is the integral of the expression, which is written as

P(H|x)P(H)11I(1:2)logdlog for 112P(H|x)P(H)22

Generalizing for k-dimensional Euclidean spaces of two dimensions with elements {X,Y}, the mutual information

between {X,Y} is given by

f(x,y)I(X:Y)f(x,y)logdxdy ，，g(x)h(y)

University of South Carolina 9 Juan E. Vargas- CSCE

CSCE 822 Data Mining & Knowledge Discovery

We can think of the pair {X,Y} as the signals that a transmitter X

sends to a receiver Y. At the transmitter, p(x) conveys the priors i

for each signal being sent, while at the receiver, p(x|y) is the ij

probability that x was sent given that ywas received. Therefore ij

the gain in information has to involve the ratio of the final and

initial ignorance, or p(x|y) / p(x). iji

N

xNLet {X} =[x,x,…,x] and ?1i12n1i

M

{Y} =[y,y,…,y] yN ?j212mj1

We can re-write the mutual information I(X:Y) for the discrete case as:

p(x,y)ijI(X:Y)p(x,y)log ??ijP(x)P(y)ijij

Using p(x,y)p(x|y)p(y)p(y|x)p(x), we can also write I(X:Y) as

p(x|y)ijI(X:Y)p(x,y)log ??ijP(x)iji

We can also write I(X:Y) as expressions involving entropy:

I(X:Y) = H(X) + H(Y) H(X,Y)

University of South Carolina 10 Juan E. Vargas- CSCE

Report this document

For any questions or suggestions please email
cust-service@docsford.com