CSCE 822 Information as a Measure

By Alfred Harper,2014-08-09 01:48
10 views 0
CSCE 822 Information as a Measure

CSCE 822 Data Mining & Knowledge Discovery

    1Information as a Measure

    Several books were written in the late 50’s to formally define the concept of information as a measure of surprise (or lack of) or as

    the uncertainty of outcomes. These books were inspired by earlier work by Shannon and Wiener, who independently arrived to the same expression for average information.

Let X be a random variable associated to a space having n

    mutually exclusive events, such that


    EU|E| = [e, e…e] so en;k, 123k1


    p1|P| = [p, p p … p] so ?k,,123n1k

    Let E(X) be some function such that, if experiments are conducted many times, the averages of X will approach E(X). Shannon and

    Wiener suggested the expression below to quantify the average uncertainty (or chaos, or disorder, or entropy) associated to a complete sample space :


    H(X)plogp ?ii1i

     1 A measure is a rather precise definition (involving such things as -algebras) which makes it difficult to

    understand for non-mathematicians such as myself. All we need to know here is that this and other definitions form the basis for much of what is known as Mathematical Analysis. For example, every definition of an integral is based on a particular measure. The study of measures and their application to integration is known as Measure Theory.

    University of South Carolina 1 Juan E. Vargas- CSCE

    CSCE 822 Data Mining & Knowledge Discovery For each event e there is a value, or quantity, x such that kk,


The term log (p) is called the amount of self-information k

    associated to the event eThe unit of information, called a bit k.

    (binary unit) is equivalent to the amount of information associated

    with selecting one event from the set. The average amount of information, called Entropy, is defined for a sample space of equally probable events E as: k


    H(X)I(E)plogp ?kkk1k

    If we have a fair coin, such that p(H) = p(T) = ?, then

    11111H(E)H(EE)log()log()log1 bit k1222222

    Note that I(E) = I(E) = -log(1/2) = 1 bit. Extending this example, 12

    Nif we have a sample with 2 equally probable events, E k

    N(k=1,2,…,2), then

    N bits I(E)logplog(2)Nkk


    I(A)E=[A, A] P=[1/256, 255/256] => = 0.0369 bit ak12

    I(B)E=[B, B] P=[?, ?] =>= 1 bit bk12

Suggesting that it is easier to guess the value of A than to guess k

    the value of Bk.

    University of South Carolina 2 Juan E. Vargas- CSCE

CSCE 822 Data Mining & Knowledge Discovery

    The measure H complies with the following axioms: 1. Continuity: If the probabilities of events change, the entropy

    associated to the system changes accordingly.

    2. Symmetry: H is invariant to the order of events, i.e.,

    H(p,p,…p) = H(p,p,…p). 12n21n

    3. Extremal Value: The value of H is largest when all events

    are equally likely, because it is most uncertain which event

    could occur.

    th4. Additivity: H = H + pH when the m event is a 21mm

    composition of other events.

    The Shannon-Wiener formulation for Entropy gained popularity due to its simplicity and its axiomatic properties. To illustrate, consider an earlier definition of information, due to R. A. Fisher, who essentially defined it as an average second moment in a sample distribution with density f(x) and mean m:

    ~lnf(x)2 I[]f(x)dx~m

    Thus, for example, expressing the normal distribution

    1xm2()exp[()] fx22

University of South Carolina 3 Juan E. Vargas- CSCE

CSCE 822 Data Mining & Knowledge Discovery

    in logarithmic form, deriving with respect to the mean, and taking the integral:

    1xm2 log()ln2ln()fx22

    ~lnfxm 2~m

    22(?xmxm11(? I)?dxexp[])?)?2222????;<