Stat I professor vinod notes for chapter 3

By Carmen Chavez,2014-05-07 11:00
13 views 0
Stat I professor vinod notes for chapter 3

Stat I, prof, Vinod, Class notes Frequency Distribution,

    Frequency Histogram, Frequency Polygon and Ogives

1) What is a frequency distribution?

    It is a summary technique for organizing data into classes. It yields a table, from which one

    calculates frequencies (f), relative frequencies f/n and cumulative frequencies. jj

2) How to construct a frequency Distribution? Need to construct a table. (See Table 1 below).

First find the smallest value Xmin and the largest value Xmax in the data. Class intervals need a

    starting-lower-limit-of-the-first-class-interval (StartLo). It should be SMALLER than or equal to

    Xmin: (In your midterm EXAM I might specify StartLo).

    Rule 1) StartLo ? Xmin (the symbol ? means less than or equal to) Rule 2) StartLo should be a number where it is intuitively natural to start an interval (e.g. a round

    number). This is not a hard and fast rule. It can happen that StartLo =1.2 say.

    How many classes should we make? Let k denote the number of classes. Let j denote the class interval number. In Table 1 we have j=0 as the first “dummy” class interval. It is called dummy

    because it has no real observations in it. It is included for the purpose of finding the midpoint

    where to join the frequency polygon on the horizontal axis.

Now j=1 is the first honest-to-goodness class interval (i.e., interval which has some real

    observations in it). This interval starts at the starting-lower-limit-of-the-first-class-interval (StartLo)

    defined above, j=2 is the next interval and so on till j=k as the last honest-to-goodness interval.

    Finally j=k+1 is the last “dummy” interval.

Rule 3) k= The number of classes chosen by the investigator. This usually ranges between 3 and

    20. If the data series is short with only say 20 observations, k=3 or 4 is adequate. More the data,

    more the classes might be needed making the k chosen by the researcher to be closer to 20.

    Rule 4) Let CIW denote “class interval width.” CIW ? (Maximum value minus the Minimum value)/ (number of classes)


    CIW ? k

    EXAMPLE: Original Unclassified Data 50 98 82 23

     46 40 63 52 92 54. We have n=10 observations here. Assume that

    we are asked to make exactly k=3 class intervals j=1 to j=3.

    We must begin by sorting the data from the smallest to the largest as:

     Xmin=23 40 46 50 52 54 63 82 92 Xmax=98

    Rule 1 says, StartLo ? Xmin which means StartLo ? 23, From Rule 2 we choose a round number

    20. Now by rule 4, the class interval width must be at least 25 by the formula:

    Xmax?Xmin98?23 or or 75/3 or 25 CIW ? kk

    This says that the Width of class intervals should be at least 25. Let us try a round number larger than 25, say 30. We choose CIW=30. It turns out that if we had chosen CIW=25 the last upper

    limit would become 95 leaving the data point 98 an orphan and we will have to revise our scheme

    by increasing the CIW.

Recall that StartLo=20 and width CIW=30. So the lower limit of first honest to goodness class

    interval is StartLo=20 and upper limit is simply StartLo plus the width 30 leading to 50. The next

    interval is simply upper limit of previous interval plus CIW=30 and so on.

An Ambiguity Solved By Convention:

    Upper limit of each class interval is the lower limit plus the width. There is ambiguity with respect

    to the upper limits, but it is resolved by convention. We always let the real upper limit be a notch

    below what is alleged to be the upper limit. For example, the convention says that the upper limit

    50 is really 49.999999999999999999999999, but not quite 50 even if it is shown to be 50.

    So the measurement 50 belongs to the next class 50 to 80. The reason for this convention is that it

    saves the Govt. in printing costs and makes the tables more readable (less cluttered).

Table 1

    Sequence no. of Lower Limit Upper limit Mid Tally Frequency Relative the class j Point marks f freq.=f/n jj0 (dummy 20 5 0 ?10=

    interval) (20?CIW)

    1 20 50 35 III 3 0.3 2 50 80 65 IIII 4 0.4 3 =k 80 110 95 III 3 0.3 Dummy interval 110 140= 125 0


    Totals n=10 1.0

This classification has been successful in the sense that we are asked to make exactly 3 classes and

    we have exactly 3 meaningful classes. The scheme is meaningful if it satisfies two tests: (i) There

    should be no orphan points and (ii) There should be no orphan intervals in the sense defined below.

Orphan Points Problem:

    If there are points in the unclassified data, which are not allocated to any interval whatsoever, then

    we call them orphan points. Then we say that the classification is not meaningful or has failed. A

    good check is to make sure that Xmin is allocated to the genuine interval with j=1 and Xmax is

    allocated to the last genuine interval with j=k.

An example of the orphan points. If we choose startLo=0 and CIW=30 then the intervals would be:

    0 To 30, 30 to 60 and 60 to 90. Now two numbers 98 and 92 are orphaned as they belong to no

    class. This is not acceptable. We have to go back to the drawing board and fix things.

Orphan Intervals Problem:

Whether we have 3 meaningful classes or not is decided by looking at the genuine class intervals

    when j=1 and j=k (Not the dummy intervals). Here they are all meaningful in the sense that there is

    at least one observation (tally mark) in these classes. So we have no orphan intervals problem,

    things are OK. An orphan interval means that there are no observations (Tally marks) in the first or

    the last interval, that is, when j=1 or j=k interval. If there are no observations in the intermediate

    intervals, (j=2 to j=k-1) that may be a true nature of the data (that is the way the cookie crumbles)

    and not an artifact of our classification scheme, hence that situation is not defined as orphan

    intervals problem. Remember that the two dummy class intervals are always and by definition

    orphans and do not pose any problem.

    An example of orphan interval: If we choose startLo=20 and CIW=40 then the intervals would be 20 To 60, 60 to 100 and 100 to 140. Now the last interval is orphaned as no one belongs in it. This

    is not acceptable. You were asked to make 3 intervals and you have effectively made two intervals

    20 to 60 and 60 to 100 which contain the entire data. We have to go back to the drawing board and

    fix things.

Trial and Error in Classification is needed if classification fails the first time around. The

    solution to the orphans problem is to go back (to the drawing board) and choose a different StartLo

    (the lower limit of the starting interval) and or a different “class interval width” (CIW) and re-do

    the tally marks and entire classification.

    Xmax?XminSince the theory requires that CIW ? k

    We can choose any CIW which satisfies this inequality. For example, our CIW=25 could have

    been larger and it will still satisfy the inequality.

    Common solutions to the orphans problem:

    If j=1 is an orphan interval StartLo may have been wrong.

    If j=k is an orphan interval, CIW is too large and needs to be reduced.

    If there are orphan points, we increase the chosen class interval width (CIW)

Frequency Distribution Graphics by a histogram and frequency polygon

    Assume that the trial and error is complete, no interval is an orphan and no point is an orphan.

    Only now we have classified the data and constructed the frequency distribution. The word

    distribution suggests that we are distributing the n items into k classes. The table represents the

    frequency distribution. Now we are ready for frequency histogram and frequency polygon which

    are graphical representations of the frequency distribution.

A histogram is a graphical image of the Frequency Distribution or Relative Frequency Distribution

    with measurements on the horizontal axis and frequency on the vertical. (It looks a bit like NYC

    skyline) We represent frequency by pillars (bars). The height of a pillar is proportional to the

    frequency in the particular class interval and the width of the pillar on the horizontal axis starts at

    the lower limit of the interval and ends at the upper limit. We draw as many pillars as are intervals.

    The dummy intervals will have zero heights since they have zero frequency. Hence it is not

    customary to show the dummy intervals for histograms. (See Table 1). In business and economics

    applications the pillars (bars) are usually attached.

    has a nice Java appelet which teaches the effect of changing width on a histogram

Two equivalent Definitions of a mid point=(upper limit + lower limit) /2

     =Lower limit + (width/2) Both definitions work.

    For software called R the following input will draw the frequency histogram and polygon x=c(23, 40, 46, 51, 52, 54, 63, 82, 92, 98)

    #note: I changed 50 to 51 to get a cleaner software illustration.

    hist(x, breaks=c(-10, 20, 50, 80, 110, 140),main="Histogram and Polygon 23, 40,

    46, 51, 52, 54, 63, 82, 92, 98",xlab="measurements", axes=FALSE)

    tik=seq(-10,140,by=15)#define location of tick marks

    axis(1,tik,tik)#first tik for location, second is for labels

    #axis(1...) is for x axis and axis(2,...) is for y axis

    axis(2, 0:4, 0:4)

     # now join consecutive midpoints to get polygon





or a FREQUENCY POLYGON we need two dummy class intervals at two ends!

    The lower limit of the first dummy interval on left side = (starting value) MINUS (width).

    In Table 1, it is 20 MINUS (CIW=30) =20?30= ?10 or MINUS 10.

     The upper limit of first dummy interval is just the lower limit of the first regular (non-dummy)

    interval, also called the starting value of the classification process.

    The upper limit of the 2nd dummy interval on the right side =(upper limit of last interval) PLUS

    CIW (width). In the following example, it is 110+30=140. Of course, the lower limit of this 2nd

    dummy interval is simply the upper limit of the last genuine (non-dummy) interval.

In order to draw the frequency polygon find the midpoints of the two dummy intervals

    Join the midpoints of all intervals consecutively at the tops of the pillars to form a polygon.

A FREQURNVY POLYGON is usually drawn right on top of the freq. histogram by joining the

    midpoints at the tops of consecutive pillars. (Take Care to include dummy intervals before drawing

the freq. polygon and determine the midpoints of dummy intervals). Note that the pillars at the

    dummy intervals have zero height, representing the fact that there are no observations there. So,

    the frequency polygon line starts at the midpoint of the left side dummy interval and ends at the

    mid point of the right side dummy interval.

Good Graph: Any valid graph should have a Title, Both axes should be properly labeled, there

    should be Legends for all curves and a source should be indicated.

Stem and Leaf display is a hybrid graphical method similar to histogram but

     the data remain visible. There is a STEM= simply the first digit of the number. For example if the number is 78, the first digit is 7 and second is 8

     LEAF= 2nd digit 8

    Just List all numbers (heart rates) with the first digit then a colon and then all numbers with that

    first digit

    For example, Heart rates are 45, 56, 44, 70, 72, 60, 61, 47, 53, 48

    then stem and leaf display is: 4: 5, 4, 7, 8

     5: 6, 3

     6: 0, 1

     7: 0, 2

    has a nice description. Sometimes the stem can be based on first two digits and sometimes the leaf

    may involve dropping the last digit.

    Baseball example: (Leaf is going in two directions for comparison of two players)

    BabeRuth stem BarryBonds

    0 4 3 2 6 0

    1 1 6 9

    9 5 2 2 5 4 5

    5 4 3 3 4 7 3 7 4 9

    1 6 7 6 9 6 1 4 6 2 0 9 6

    4 9 4 5

    0 6

     7 3

    What do you conclude? Babe was 1914 to 1935, Barry was 1986 to 2003

     Relative frequency Distribution.

    Relative freq. in class j = (frequency number in class j) / (total no. of observations).

    These must add up to 1 as they do. Relative freq. is interpreted as the probability of being in that

    class interval. Lord Keynes give the prob. interpretation in 1920’s in a book called Treatise on


Sequence no. Lower Upper limit Mid-Point Frequency Relative

    of class j Limit Frequency

    0 (dummy) 20 10/2 = 5 0 0 ?10

    1 20 50 35 3 0.3

    2 50 80 65 4 0.4 3 80 110 95 3 0.3 4 (dummy) 110 140 125 0 0


    Total 10

Cumulative frequency (top to bottom)=sum of freq in the current class and all previous classes. It

    goes with the upper limits of class intervals. Interpretation of cumulative freq for j=1 (top to

    bottom case) is simply that there are 3 measurements in the data set with measurements less than the upper limit 50 of the j=1 interval. A graph of this having upper limits of intervals

    (measurements) on the horizontal axis and cumulative frequency on vertical axis is called “Less

    Than Ogive.” The scale on the vertical axis is

Cumulative frequency (bottom to top)=sum of the frequency in the current class and all

    subsequent classes below it. This number goes with the lower limits of class intervals. What does it mean? Interpretation of cumulative freq for j=2 (bottom to top case) is simply that there are 7

    measurements in the data set with measurements greater than the lower limit 50 of the j=2 interval. A graph of this having lower limits of intervals (measurements) on the horizontal axis and bottom

    to top cumulative frequencies on vertical axis is called “Greater Than Ogive.” Be sure the label the

    axes as measurements of horizontal axis and cumulative frequency for vertical axis.

See the Table below. The scale on the vertical axis for the cumulative frequencies goes from zero to

    n (=10). Hence it is obvious that the graph of Ogives should be drawn separately from the graph of

    histogram or polygon. However the two Ogives can and should be drawn on the same graph

    because the Median of the Classified Data (grouped data) can be determined graphically as the

    measurement where the “Less Than Ogive” intersects the “Greater Than Ogive.”

    Sequence Lower Limit Upper limit Frequency Cumulative Cumulative no. of class j (horizontal (horizontal Freq (top to Freq

    axis for axis for bottom) (bottom to

    “Greater “Less than Less than top)

    than ogive”) ogive”) Ogive Greater

    than Ogive 0 (dummy) 20 0 0 10 ?10

    1 20 50 3 3 10 2 50 80 4 7 7 3 80 110 3 10 3 4 (dummy) 110 140 0 10 0


A Warning:

See "How to Lie With Statistics" by the author, Darrell Huff. He wrote that "the secret language of

    statistics, so appealing in a fact-minded culture, is employed to sensationalize, inflate, confuse and


    He preached to readers that they were part of the chain of accountability, and needed to look for

    bias in the origin of the statistics or their treatment, and to ask the kinds of questions that poked

    holes in shoddy or dishonest work. "

Report this document

For any questions or suggestions please email