CSCE 822 Information as a Measure

By Alfred Harper,2014-08-09 01:48
9 views 0
CSCE 822 Information as a Measure

CSCE 822 Data Mining & Knowledge Discovery

    1Information as a Measure

    Several books were written in the late 50’s to formally define the concept of information as a measure of surprise (or lack of) or as

    the uncertainty of outcomes. These books were inspired by earlier work by Shannon and Wiener, who independently arrived to the same expression for average information.

Let X be a random variable associated to a space having n

    mutually exclusive events, such that


    EU|E| = [e, e…e] so en;k, 123k1


    p1|P| = [p, p p … p] so ?k,,123n1k

    Let E(X) be some function such that, if experiments are conducted many times, the averages of X will approach E(X). Shannon and

    Wiener suggested the expression below to quantify the average uncertainty (or chaos, or disorder, or entropy) associated to a complete sample space :


    H(X)plogp ?ii1i

     1 A measure is a rather precise definition (involving such things as -algebras) which makes it difficult to

    understand for non-mathematicians such as myself. All we need to know here is that this and other definitions form the basis for much of what is known as Mathematical Analysis. For example, every definition of an integral is based on a particular measure. The study of measures and their application to integration is known as Measure Theory.

    University of South Carolina 1 Juan E. Vargas- CSCE

    CSCE 822 Data Mining & Knowledge Discovery For each event e there is a value, or quantity, x such that kk,


The term log (p) is called the amount of self-information k

    associated to the event eThe unit of information, called a bit k.

    (binary unit) is equivalent to the amount of information associated

    with selecting one event from the set. The average amount of information, called Entropy, is defined for a sample space of equally probable events E as: k


    H(X)I(E)plogp ?kkk1k

    If we have a fair coin, such that p(H) = p(T) = ?, then

    11111H(E)H(EE)log()log()log1 bit k1222222

    Note that I(E) = I(E) = -log(1/2) = 1 bit. Extending this example, 12

    Nif we have a sample with 2 equally probable events, E k

    N(k=1,2,…,2), then

    N bits I(E)logplog(2)Nkk


    I(A)E=[A, A] P=[1/256, 255/256] => = 0.0369 bit ak12

    I(B)E=[B, B] P=[?, ?] =>= 1 bit bk12

Suggesting that it is easier to guess the value of A than to guess k

    the value of Bk.

    University of South Carolina 2 Juan E. Vargas- CSCE

CSCE 822 Data Mining & Knowledge Discovery

    The measure H complies with the following axioms: 1. Continuity: If the probabilities of events change, the entropy

    associated to the system changes accordingly.

    2. Symmetry: H is invariant to the order of events, i.e.,

    H(p,p,…p) = H(p,p,…p). 12n21n

    3. Extremal Value: The value of H is largest when all events

    are equally likely, because it is most uncertain which event

    could occur.

    th4. Additivity: H = H + pH when the m event is a 21mm

    composition of other events.

    The Shannon-Wiener formulation for Entropy gained popularity due to its simplicity and its axiomatic properties. To illustrate, consider an earlier definition of information, due to R. A. Fisher, who essentially defined it as an average second moment in a sample distribution with density f(x) and mean m:

    ~lnf(x)2 I[]f(x)dx~m

    Thus, for example, expressing the normal distribution

    1xm2()exp[()] fx22

University of South Carolina 3 Juan E. Vargas- CSCE

CSCE 822 Data Mining & Knowledge Discovery

    in logarithmic form, deriving with respect to the mean, and taking the integral:

    1xm2 log()ln2ln()fx22

    ~lnfxm 2~m

    22(?xmxm11(? I)?dxexp[])?)?2222????

    The Shannon-Wiener expression for information can be generalized for 2-dimensional probability schemes, and by induction, to any n-dimensional probability schemes. Let and 1

     be two discrete sample spaces with sets {E} =[E,E,…,E] and 212n

    {F} =[F,F,…,F]. 12m

    We can have three complete sets of probability schemes:

P{E} = [P{E}] k

    P{F} = [P{F}] j

    P{EF}= [P{EF}] kj

    University of South Carolina 4 Juan E. Vargas- CSCE

    CSCE 822 Data Mining & Knowledge Discovery The joint probability matrix is given by:

    p{1,1}p{1,2}....p{1,m}???p{2,1}p{2,2}...p{2.m}??[P{X,Y}] ??............??p{n,1}p{n,2}...p{n,m}???

    We can obtain the marginal probabilities for each variable as in:



    p{1,1}p{1,2}....p{1,m}p(x1)???p{2,1}p{2,2}...p{2.m}p(x2)??[P{X,Y}]= ??............p(x3)??p{n,1}p{n,2}...p{n,m}p(x4)???

    mP{x}p(x,y)p{x,y} ??kkj{,}\1xyyj



    nP{y}p(x,y)p{x,y} ??jkj{,}\1xyxk

From the matrix we can also compute the total and marginal

    entropies, H(X), H(Y), H(X,Y)


    H(X,Y)p{k,j}logp{k,j} ??,,11kj

    nmmn?(?)?H(X)p{k,j}logp{k,j}p{x}logp{x} ??????kk)?,,1111kjjk?????

    University of South Carolina 5 Juan E. Vargas- CSCE

    CSCE 822 Data Mining & Knowledge Discovery

    mnnm?(? )?H(Y)p{k,j}logp{k,j}p{y}logp{y}??????jj)?,,1111??jkkj?

Note that to obtain H we must find the corresponding p(x) and k

    p(y) first. To better understand the calculations involved in H(X,Y) j

    versus H(X), and H(Y), let m=n=3. Then







    nmmn?(?)?H(X)p{k,j}logp{k,j}p{x}logp{x} ??????kk)?,,1111kjjk?????

    - [p(1,1)+p(1,2)+p(1,3)][log(p(1,1)+p(1,2)+p(1,3)] H(X)

    - [p(2,1)+p(2,2)+p(2,3)][log(p(2,1)+p(2,2)+p(2,3)]

    - [p(3,1)+p(3,2)+p(3,3)][log(p(3,1)+p(3,2)+p(3,3)]

We can also compute the conditional entropies. Due to the


    1addition theorem of probability, the union of E is k?Ekk1

    Therefore, marginalizing E k


    FEF ;jkjk1

    p(x,y)p(x|y)p(y)p(y|x)p(x)from Bayes theorem: , therefore: University of South Carolina 6 Juan E. Vargas- CSCE

    CSCE 822 Data Mining & Knowledge Discovery

    P{Xx?Yy}kj P{Xx|Yy}kjP{Yy}j

    p{k,j} p{x|y}kjp{y}j

    thwhere p{y} is the j marginal; and so j





    Note again that the two equations imply that you have to compute

    the marginals first. From Bayes theorem p(x,y)p(x|y)p(y)p(y|x)p(x)

    we can write:

    H(X,Y) = H(X|Y) + H(Y) = H(Y|X) + H(X)

    Example: Two “honest” dice, X and Y, are thrown. Compute H(X,Y), H(X), H(Y), H(X|Y) and H(Y|X). The joint probability table is:

    Y\X 1 2 3 4 5 6 e(x)

    1 1/36 1/36 1/36 1/36 1/36 1/36 1/6

    2 1/36 1/36 1/36 1/36 1/36 1/36 1/6

    3 1/36 1/36 1/36 1/36 1/36 1/36 1/6

    4 1/36 1/36 1/36 1/36 1/36 1/36 1/6

    5 1/36 1/36 1/36 1/36 1/36 1/36 1/6

    6 1/36 1/36 1/36 1/36 1/36 1/36 1/6

    f(y) 1/6 1/6 1/6 1/6 1/6 1/6 1/1

    The entropies can be calculated from the table: University of South Carolina 7 Juan E. Vargas- CSCE

CSCE 822 Data Mining & Knowledge Discovery

    6611 bits H(X,Y)Ploglog5.17??ij3636ij,,11

    611H(X)H(Y)Ploglog2.58 bits ?i66i1

    661 bits H(X|Y)H(Y|X)Plog2.58??ij6ij,,11

A Measure of Mutual Information

    We would like to formulate a measure for the mutual information between two symbols, (x,y). Solomon Kullback wrote in 1958 a ij

    book on the study of logarithmic measures of information and their application to the testing of statistical hypotheses such as determining if two independent, random samples, were drawn from the same population, or if the samples are conditionally independent, etc.

Let H (i=1,2) be the hypothesis that X is from a population with a i

    probability measure . Applying Bayes theorem: i

    P(H)fi(x)iP(H|x) for i=1,2 iP(H)f(x)P(H)f(x)1122

    University of South Carolina 8 Juan E. Vargas- CSCE

CSCE 822 Data Mining & Knowledge Discovery

    Expanding P(H|x) for i=1,2 solving for f1 and f2 and simplifying i

f(x)P(H|x)P(H)112 f(x)P(H|x)P(H)221

taking the log we obtain

f(x)P(H|x)P(H)111logloglog f(x)P(H|x)P(H)222

    The right side of the equation is a measure of the difference between the odds in favor of H1 after the observation X=x and before the observation. Kullback defined this expression as the information in X=x for discriminating in favor of H1 against H2.

    The mean information is the integral of the expression, which is written as

    P(H|x)P(H)11I(1:2)logdlog for 112P(H|x)P(H)22

    Generalizing for k-dimensional Euclidean spaces of two dimensions with elements {X,Y}, the mutual information

    between {X,Y} is given by

    f(x,y)I(X:Y)f(x,y)logdxdy ,,g(x)h(y)

    University of South Carolina 9 Juan E. Vargas- CSCE

    CSCE 822 Data Mining & Knowledge Discovery

We can think of the pair {X,Y} as the signals that a transmitter X

    sends to a receiver Y. At the transmitter, p(x) conveys the priors i

    for each signal being sent, while at the receiver, p(x|y) is the ij

    probability that x was sent given that ywas received. Therefore ij

    the gain in information has to involve the ratio of the final and

    initial ignorance, or p(x|y) / p(x). iji


    xNLet {X} =[x,x,…,x] and ?1i12n1i


    {Y} =[y,y,…,y] yN ?j212mj1

    We can re-write the mutual information I(X:Y) for the discrete case as:

    p(x,y)ijI(X:Y)p(x,y)log ??ijP(x)P(y)ijij

    Using p(x,y)p(x|y)p(y)p(y|x)p(x), we can also write I(X:Y) as

    p(x|y)ijI(X:Y)p(x,y)log ??ijP(x)iji

    We can also write I(X:Y) as expressions involving entropy:

I(X:Y) = H(X) + H(Y) H(X,Y)

    University of South Carolina 10 Juan E. Vargas- CSCE

Report this document

For any questions or suggestions please email