Stat I, prof, Vinod, Class notes Frequency Distribution,
Frequency Histogram, Frequency Polygon and Ogives
1) What is a frequency distribution?
It is a summary technique for organizing data into classes. It yields a table, from which one
calculates frequencies (f), relative frequencies f/n and cumulative frequencies. jj
2) How to construct a frequency Distribution? Need to construct a table. (See Table 1 below).
First find the smallest value Xmin and the largest value Xmax in the data. Class intervals need a
starting-lower-limit-of-the-first-class-interval (StartLo). It should be SMALLER than or equal to
Xmin: (In your midterm EXAM I might specify StartLo).
Rule 1) StartLo ? Xmin (the symbol ? means less than or equal to) Rule 2) StartLo should be a number where it is intuitively natural to start an interval (e.g. a round
number). This is not a hard and fast rule. It can happen that StartLo =1.2 say.
How many classes should we make? Let k denote the number of classes. Let j denote the class interval number. In Table 1 we have j=0 as the first “dummy” class interval. It is called dummy
because it has no real observations in it. It is included for the purpose of finding the midpoint
where to join the frequency polygon on the horizontal axis.
Now j=1 is the first honest-to-goodness class interval (i.e., interval which has some real
observations in it). This interval starts at the starting-lower-limit-of-the-first-class-interval (StartLo)
defined above, j=2 is the next interval and so on till j=k as the last honest-to-goodness interval.
Finally j=k+1 is the last “dummy” interval.
Rule 3) k= The number of classes chosen by the investigator. This usually ranges between 3 and
20. If the data series is short with only say 20 observations, k=3 or 4 is adequate. More the data,
more the classes might be needed making the k chosen by the researcher to be closer to 20.
Rule 4) Let CIW denote “class interval width.” CIW ? (Maximum value minus the Minimum value)/ (number of classes)
CIW ? k
EXAMPLE: Original Unclassified Data 50 98 82 23
46 40 63 52 92 54. We have n=10 observations here. Assume that
we are asked to make exactly k=3 class intervals j=1 to j=3.
We must begin by sorting the data from the smallest to the largest as:
Xmin=23 40 46 50 52 54 63 82 92 Xmax=98
Rule 1 says, StartLo ? Xmin which means StartLo ? 23, From Rule 2 we choose a round number
20. Now by rule 4, the class interval width must be at least 25 by the formula:
Xmax?Xmin98?23 or or 75/3 or 25 CIW ? kk
This says that the Width of class intervals should be at least 25. Let us try a round number larger than 25, say 30. We choose CIW=30. It turns out that if we had chosen CIW=25 the last upper
limit would become 95 leaving the data point 98 an orphan and we will have to revise our scheme
by increasing the CIW.
Recall that StartLo=20 and width CIW=30. So the lower limit of first honest to goodness class
interval is StartLo=20 and upper limit is simply StartLo plus the width 30 leading to 50. The next
interval is simply upper limit of previous interval plus CIW=30 and so on.
An Ambiguity Solved By Convention:
Upper limit of each class interval is the lower limit plus the width. There is ambiguity with respect
to the upper limits, but it is resolved by convention. We always let the real upper limit be a notch
below what is alleged to be the upper limit. For example, the convention says that the upper limit
50 is really 49.999999999999999999999999, but not quite 50 even if it is shown to be 50.
So the measurement 50 belongs to the next class 50 to 80. The reason for this convention is that it
saves the Govt. in printing costs and makes the tables more readable (less cluttered).
Sequence no. of Lower Limit Upper limit Mid Tally Frequency Relative the class j Point marks f freq.=f/n jj0 (dummy 20 5 0 ?10=
1 20 50 35 III 3 0.3 2 50 80 65 IIII 4 0.4 3 =k 80 110 95 III 3 0.3 Dummy interval 110 140= 125 0
Totals n=10 1.0
This classification has been successful in the sense that we are asked to make exactly 3 classes and
we have exactly 3 meaningful classes. The scheme is meaningful if it satisfies two tests: (i) There
should be no orphan points and (ii) There should be no orphan intervals in the sense defined below.
Orphan Points Problem:
If there are points in the unclassified data, which are not allocated to any interval whatsoever, then
we call them orphan points. Then we say that the classification is not meaningful or has failed. A
good check is to make sure that Xmin is allocated to the genuine interval with j=1 and Xmax is
allocated to the last genuine interval with j=k.
An example of the orphan points. If we choose startLo=0 and CIW=30 then the intervals would be:
0 To 30, 30 to 60 and 60 to 90. Now two numbers 98 and 92 are orphaned as they belong to no
class. This is not acceptable. We have to go back to the drawing board and fix things.
Orphan Intervals Problem:
Whether we have 3 meaningful classes or not is decided by looking at the genuine class intervals
when j=1 and j=k (Not the dummy intervals). Here they are all meaningful in the sense that there is
at least one observation (tally mark) in these classes. So we have no orphan intervals problem,
things are OK. An orphan interval means that there are no observations (Tally marks) in the first or
the last interval, that is, when j=1 or j=k interval. If there are no observations in the intermediate
intervals, (j=2 to j=k-1) that may be a true nature of the data (that is the way the cookie crumbles)
and not an artifact of our classification scheme, hence that situation is not defined as orphan
intervals problem. Remember that the two dummy class intervals are always and by definition
orphans and do not pose any problem.
An example of orphan interval: If we choose startLo=20 and CIW=40 then the intervals would be 20 To 60, 60 to 100 and 100 to 140. Now the last interval is orphaned as no one belongs in it. This
is not acceptable. You were asked to make 3 intervals and you have effectively made two intervals
20 to 60 and 60 to 100 which contain the entire data. We have to go back to the drawing board and
Trial and Error in Classification is needed if classification fails the first time around. The
solution to the orphans problem is to go back (to the drawing board) and choose a different StartLo
(the lower limit of the starting interval) and or a different “class interval width” (CIW) and re-do
the tally marks and entire classification.
Xmax?XminSince the theory requires that CIW ? k
We can choose any CIW which satisfies this inequality. For example, our CIW=25 could have
been larger and it will still satisfy the inequality.
Common solutions to the orphans problem:
If j=1 is an orphan interval StartLo may have been wrong.
If j=k is an orphan interval, CIW is too large and needs to be reduced.
If there are orphan points, we increase the chosen class interval width (CIW)
Frequency Distribution Graphics by a histogram and frequency polygon
Assume that the trial and error is complete, no interval is an orphan and no point is an orphan.
Only now we have classified the data and constructed the frequency distribution. The word
distribution suggests that we are distributing the n items into k classes. The table represents the
frequency distribution. Now we are ready for frequency histogram and frequency polygon which
are graphical representations of the frequency distribution.
A histogram is a graphical image of the Frequency Distribution or Relative Frequency Distribution
with measurements on the horizontal axis and frequency on the vertical. (It looks a bit like NYC
skyline) We represent frequency by pillars (bars). The height of a pillar is proportional to the
frequency in the particular class interval and the width of the pillar on the horizontal axis starts at
the lower limit of the interval and ends at the upper limit. We draw as many pillars as are intervals.
The dummy intervals will have zero heights since they have zero frequency. Hence it is not
customary to show the dummy intervals for histograms. (See Table 1). In business and economics
applications the pillars (bars) are usually attached.
has a nice Java appelet which teaches the effect of changing width on a histogram
Two equivalent Definitions of a mid point=(upper limit + lower limit) /2
=Lower limit + (width/2) Both definitions work.
For software called R the following input will draw the frequency histogram and polygon x=c(23, 40, 46, 51, 52, 54, 63, 82, 92, 98)
#note: I changed 50 to 51 to get a cleaner software illustration.
hist(x, breaks=c(-10, 20, 50, 80, 110, 140),main="Histogram and Polygon 23, 40,
46, 51, 52, 54, 63, 82, 92, 98",xlab="measurements", axes=FALSE)
tik=seq(-10,140,by=15)#define location of tick marks
axis(1,tik,tik)#first tik for location, second is for labels
#axis(1...) is for x axis and axis(2,...) is for y axis
axis(2, 0:4, 0:4)
# now join consecutive midpoints to get polygon
or a FREQUENCY POLYGON we need two dummy class intervals at two ends!
The lower limit of the first dummy interval on left side = (starting value) MINUS (width).
In Table 1, it is 20 MINUS (CIW=30) =20?30= ?10 or MINUS 10.
The upper limit of first dummy interval is just the lower limit of the first regular (non-dummy)
interval, also called the starting value of the classification process.
The upper limit of the 2nd dummy interval on the right side =(upper limit of last interval) PLUS
CIW (width). In the following example, it is 110+30=140. Of course, the lower limit of this 2nd
dummy interval is simply the upper limit of the last genuine (non-dummy) interval.
In order to draw the frequency polygon find the midpoints of the two dummy intervals
Join the midpoints of all intervals consecutively at the tops of the pillars to form a polygon.
A FREQURNVY POLYGON is usually drawn right on top of the freq. histogram by joining the
midpoints at the tops of consecutive pillars. (Take Care to include dummy intervals before drawing
the freq. polygon and determine the midpoints of dummy intervals). Note that the pillars at the
dummy intervals have zero height, representing the fact that there are no observations there. So,
the frequency polygon line starts at the midpoint of the left side dummy interval and ends at the
mid point of the right side dummy interval.
Good Graph: Any valid graph should have a Title, Both axes should be properly labeled, there
should be Legends for all curves and a source should be indicated.
Stem and Leaf display is a hybrid graphical method similar to histogram but
the data remain visible. There is a STEM= simply the first digit of the number. For example if the number is 78, the first digit is 7 and second is 8
LEAF= 2nd digit 8
Just List all numbers (heart rates) with the first digit then a colon and then all numbers with that
For example, Heart rates are 45, 56, 44, 70, 72, 60, 61, 47, 53, 48
then stem and leaf display is: 4: 5, 4, 7, 8
5: 6, 3
6: 0, 1
7: 0, 2
has a nice description. Sometimes the stem can be based on first two digits and sometimes the leaf
may involve dropping the last digit.
Baseball example: (Leaf is going in two directions for comparison of two players)
BabeRuth stem BarryBonds
0 4 3 2 6 0
1 1 6 9
9 5 2 2 5 4 5
5 4 3 3 4 7 3 7 4 9
1 6 7 6 9 6 1 4 6 2 0 9 6
4 9 4 5
What do you conclude? Babe was 1914 to 1935, Barry was 1986 to 2003
Relative frequency Distribution.
Relative freq. in class j = (frequency number in class j) / (total no. of observations).
These must add up to 1 as they do. Relative freq. is interpreted as the probability of being in that
class interval. Lord Keynes give the prob. interpretation in 1920’s in a book called Treatise on
Sequence no. Lower Upper limit Mid-Point Frequency Relative
of class j Limit Frequency
0 (dummy) 20 10/2 = 5 0 0 ?10
1 20 50 35 3 0.3
2 50 80 65 4 0.4 3 80 110 95 3 0.3 4 (dummy) 110 140 125 0 0
Cumulative frequency (top to bottom)=sum of freq in the current class and all previous classes. It
goes with the upper limits of class intervals. Interpretation of cumulative freq for j=1 (top to
bottom case) is simply that there are 3 measurements in the data set with measurements less than the upper limit 50 of the j=1 interval. A graph of this having upper limits of intervals
(measurements) on the horizontal axis and cumulative frequency on vertical axis is called “Less
Than Ogive.” The scale on the vertical axis is
Cumulative frequency (bottom to top)=sum of the frequency in the current class and all
subsequent classes below it. This number goes with the lower limits of class intervals. What does it mean? Interpretation of cumulative freq for j=2 (bottom to top case) is simply that there are 7
measurements in the data set with measurements greater than the lower limit 50 of the j=2 interval. A graph of this having lower limits of intervals (measurements) on the horizontal axis and bottom
to top cumulative frequencies on vertical axis is called “Greater Than Ogive.” Be sure the label the
axes as measurements of horizontal axis and cumulative frequency for vertical axis.
See the Table below. The scale on the vertical axis for the cumulative frequencies goes from zero to
n (=10). Hence it is obvious that the graph of Ogives should be drawn separately from the graph of
histogram or polygon. However the two Ogives can and should be drawn on the same graph
because the Median of the Classified Data (grouped data) can be determined graphically as the
measurement where the “Less Than Ogive” intersects the “Greater Than Ogive.”
Sequence Lower Limit Upper limit Frequency Cumulative Cumulative no. of class j (horizontal (horizontal Freq (top to Freq
axis for axis for bottom) (bottom to
“Greater “Less than Less than top)
than ogive”) ogive”) Ogive Greater
than Ogive 0 (dummy) 20 0 0 10 ?10
1 20 50 3 3 10 2 50 80 4 7 7 3 80 110 3 10 3 4 (dummy) 110 140 0 10 0
See "How to Lie With Statistics" by the author, Darrell Huff. He wrote that "the secret language of
statistics, so appealing in a fact-minded culture, is employed to sensationalize, inflate, confuse and
He preached to readers that they were part of the chain of accountability, and needed to look for
bias in the origin of the statistics or their treatment, and to ask the kinds of questions that poked
holes in shoddy or dishonest work. "