Advanced Probability and Statistics Module 3
Greetings Statfolk. As I read chapters 4 - 6, I found a couple of mistakes in the book. On page 131 the variance formula is wrong. It should be multiplication in the numerator rather than addition. Please make this change in pencil in your books. Also, in Figure 5.3 on page 151, the caption claims that the distribution does not change shape, but it obviously does. The book is correct with a certain condition. The book does a poor job explaining this, so I’ll try to do better in one of the
_x，problems below. Finally, to clarify some notation, the book writes . This isn’t that bad, but technically it should be x;n
x，i_;1iwritten . Explanation: there are n pieces of data labeled x through x; the subscripts make this clear. i is an iterator x;1nn
that runs from 1 to n to define clearly where the summation is to begin and end. Incidentally, I created the above equations with Microsoft Equation Editor, which you might have to install from your disc if you want to use it (see Help menu—―install
equation editor.‖ However, you can get around this. In the Insert menu choose Symbol, then in the Font box choose Symbol and you’ll find the capital sigma symbol: ；. You could then write x-bar = (1/n) ；x assuming that you have already defined i
the use of ； to mean the sum on i from 1 to n. To create a subscript in Word, press Control and the = sign. Then type the
subscript. These keystrokes toggle the subscripting on and off. (To create a superscript, press Control-Shift-Equal.) Putting
identifiers in equations in italics is a good idea.
You’ll need to read through chapters 4, 5, 6. There are a lot of tables and graphs, so it won’t be too bad.
The first two columns of the spreadsheet is a repeat of the 2 x - x |x - x| (x - x) x i xiibaribaribarbarStooge data from Module 2. As you recall, we measured the 1 34 0.33 0.33 0.11 33.67 variability of the data via the interquartile range. This time
2 13 -20.67 20.67 427.11 we’ll do it with the standard deviation. Create a spreadsheet 2like the one below. You can copy and paste the first two 3 23 s -10.67 10.67 113.78
columns, but you’ll have to figure out what formulae to put in 4 25 -8.67 8.67 75.11 276.75 the other cells. You will know your spreadsheet is correct 5 36 2.33 2.33 5.44 when it matches up with what is shown here. You can get 6 45 s 11.33 11.33 128.44 subscripts and superscripts in Excel by highlighting and going
7 19 -14.67 14.67 215.11 16.64 to Format, Cells, Font. Calculate x-bar using the Average
function. The third, fourth, and fifth columns should refer 8 34 0.33 0.33 0.11
back to the mean. (Make sure to use dollar signs in the 9 25 -8.67 8.67 75.11 Excel formula where necessary.) Use the Sum function to get the 10 33 -0.67 0.67 0.44 functions: totals at the bottom (or use the sigma button on the toolbar). 211 17 -16.67 16.67 277.78 Calculate s by dividing one of the totals by the appropriate 2212 9 s -24.67 24.67 608.44 number. Use s to compute s. Then use the built-in Var and 2Stdev functions in Excel to check your s and s values. To get 13 27 -6.67 6.67 44.44 276.75
help with any function you can use the function wizard. Click 14 53 19.33 19.33 373.78 on fx in the function bar to see a list of functions. You can 15 89 s 55.33 55.33 3061.78 select Statistical as a category and then type V to jump down 16 30 -3.67 3.67 13.44 16.64 to functions starting with a V. If you select Var, it will guide
17 39 5.33 5.33 28.44 you through using that function. You can also click on ―Help
With This Function‖ to see examples. Note: With all four of 18 20 -13.67 13.67 186.78
the functions we’re using you can enter just one argument—19 49 15.33 15.33 235.11 the list of numbers—rather than entering each number 20 35 1.33 1.33 1.78 separately. This list can be entered by clicking on the first 21 54 20.33 20.33 413.44 number, holding, and letting go on the last number. For
22 40 6.33 6.33 40.11 example, to find the mean of number in cells G100 through
G130 you would enter =average(G100:G130). You don’t 23 28 -5.67 5.67 32.11
even have to type G100:G130. Instead, after typing the left 24 31 -2.67 2.67 7.11 parenthesis, just click on G100 and release on G130. Totals 0.00 275.33 6365.33
1. a. What are numbers called in the third column? (There are two different names for them.) b. Interpret the positive or
negative signs of these numbers. c. Interpret the magnitudes of these numbers. d. Notice that they sum to zero. This is no
fluke. Give a formal mathematical proof that this is always the case for any data. It’s a very short proof, but it will help to
know the following fact about summations: the sum of from 1 to n of a constant means to add up the constant n times,
which yields n times the constant.
2. One reason for squaring the deviations before summing them is to make them positive. As you just proved, simply
summing the deviations always yields zero and, therefore, gives no information about how spread out the data is. By
averaging the sum of the squares of the deviations and taking the square root, we get the standard deviation. The fourth
column wasn’t necessary for computing the variance or standard deviation, but look how close the total for column 4 is to 2s. Using absolute values is a perfectly valid way of measuring variability in a data set, but it is not commonly used.
Explain why the two numbers are not exactly the same. That is, why don’t squaring and squaring rooting undo each other
to give the total in column 4?
3. a. What are the units for our standard deviation and variance? b. What is the interpretation of the standard deviation? c.
Delete (temporarily) all the numbers in the row corresponding to the outlier. What is the new s value? Hints: Refer to
Excel’s value for s, or modify your own to divide by one less; the new s should be 12-ish. This shows that, without the
outlier, the data are almost 30% less spread out (as measured by standard deviation), even though the mean doesn’t
change much. d. If our original Stooge data were distributed normally (which they’re not), about how many of the 24
episodes would show Curley or Larry getting smacked in the head between 17 and 50 times? Explain. 4. a. What’s the difference between ？ (lower case sigma) and s? b. The correspondences with the mean are ？ and x-bar.
What do you suppose ？ is for?
5. In explaining why we divide by n – 1 rather than n when computing s, the book discusses degrees of freedom. How many
degrees of freedom are there when it comes to: a. the angles in a triangle? b. the heights of 50 people, knowing the
average height is 5’9‖? c. the standard broad jump distances of n PE students where the average jump distance is known?
6. Give an example of an experiment in which binary data is collected. Make up some numbers and state what n, X, Y, and p
equal. Compute the mean, which should be the same as p.
7. Show that the formula given for variance of binary data on page 127 is equivalent to how it was computed in the
spreadsheet: the sum of the squares of the deviations divided by (n – 1). Hints: write x-bar in terms of X; expand a
binomial; ；(a ? b) = ；a ? ；b ; X is a constant, hence it come out of summations. iiii
8. Using similar reasoning as in the last proof, it can shown that the formula on page 128 gives the variance of binary data as
a function of p. Thus, variance = f(p) =np(1 - p)/(n – 1). Suppose you’re estimating the number of people in Urbana who
have been to Europe. You pick many people at random and ask them (1 for yes, 0 for no). If everyone has been to
Europe or if nobody has, there should be no variability at all. a. Use the function above to show that this is indeed the
case. b. My intuition tells me that the most variability would occur if exactly half of the people went and half never went.
What is the variance in this case? c. Prove that my intuition is correct using calculus. Hints: Note that n is a constant, so
if we graphed the function, it would be a parabola opening down, since f is proportional to p(1 - p). Expand and take a kk - 1derivative term by term. Remember the power rule: if y = x , then y’ = kx for constant k. The derivative is a formula
for the slopes of tangent lines on the graph of the original function. The maximum value of our function must occur
where the tangent line is horizontal. Thus, we want to know where the slope is zero. So, set your derivative equal to zero
and solve for p.
9. Let’s see what formula Excel uses for standard deviation. Use the Help as described before problem #1 to see it. Notice
that there are two summations in the formula Excel uses, compared to one in the formula we use. The formulae are
equivalent however, which is an optional proof for you. At least explain in words in what way the summations are
10. Play around with a graphing calculator and find a function that has a graph that looks like a histogram for a normal
distribution (a nice bell curve). Hints: You’ll need a vertical axis of symmetry; for simplicity let this be the y-axis. You
know some properties of functions when it comes to symmetry. You also want a function with a maximum at x = 0 and
limits of zero as x ，??. The domain must be all real numbers. There is no periodicity, so forget trig functions. It’s easy
to prove that a polynomial won’t work either.
11. A normal distribution can have any mean and any (positive) standard deviation. Suppose two normal distributions have
the same mean but one has a bigger standard deviation. Briefly describe how their graphs differ. 12. A standard normal distribution has a mean of zero, a standard deviation of 1, and long tails that diminish to zero for large
positive and negative values, like the graph you hopefully found in #10. A very cool property of the standard normal’s
graph is that all the area between it and the x-axis is exactly 1, and areas are equivalent to probabilities. It is possible to
transform any normal distribution into a standard normal one and use the table on the front and back covers of the book to
find percentile and probabilities. Suppose the Acme Soy Burger Factory cranks out thousands of soy burgers a day. On
their packaging it claims each patty has a mass of 70 g, but processing is not perfect. The masses of the patties are
actually distributed normally with a mean of 70 g and a standard deviation of 3 g. a. Within what range of masses will
about 68% of the burgers lie? Hint: see bottom of page 115. b. 2.5% of the burgers will be more massive than ___ . Hint:
see page 115 and use symmetry. c. What percent of the patties are from 64 to 73 grams? Hint: see page 115 and use
symmetry. d. If a burger were selected at random, what is the probability that its mass is less than 61 g ? In order to
answer questions concerning masses that aren’t multiples of the standard deviation, we’ll have to transform the data to
make it standard normal: all masses will have to be decreased by 70 g (to shift the curve from a peak at 70 to a peak at 0); then we divide by the standard deviation of 3 g so that the new standard deviation will be 1. An example of this is on page 123. If you’ve got a 70.4 g patty, it should be just over the 50 percentile mark. Let’s figure it out. First subtract the
70 g = +0.4 g (the number of grams this burger above the mean). Then divide by the stan. dev: 0.4 g / 3 g mean: 70.4 g –
！ +0.13. This is called a Z statistic, and it tells us that our burger is 0.13 standard deviations above the mean. Looking ththis value up on the front cover gives a probability of 0.552, meaning that this particular patty is in the 55 percentile for
mass. It also means that the probability for randomly selecting a burger less than 70.4 g is about 55%. There’s really
nothing special about a normal curve being standard, but if we didn’t have some standard, we’d need a table for every
possible normal distribution (every infinite possible combination of mean & standard deviation). e. What is the probability that a random burger will be less than 68.9 g ? f. greater than 66.5 g ? g. between 65g and 71 g ? You may
wonder how the table in your book was complied. As with trig tables and log tables, the answer involves calculus. There is a function that describes the standard normal curve, and the integral of this function from one point A to point B on the x-axis is the area under the curve, which corresponds to the probability of a random choice being between the values A and B. h. Use the table, rather than calculus, to find the area under the standard normal curve from x = -0.35 to x = 1.16.
With Excel you don’t need a table, and you don’t even have to use a Z value. Example: =NORMDIST(67,70,3,TRUE)
will evaluate to 0.1587, the probability that a randomly chosen value will be less than 67 when chosen from a normal distribution of mean 70 and standard deviation 3. i. Use the Normdist function to find the probability that a random patty is less than 63.2807 g. (It should be almost 13%.) Suppose now that the probability of a burger being under a certain
mass is 63.8%. Here’s how we find that mass. We use the table in reverse: look for 0.638 in the probability column. The closest number we find is 0.637, which corresponds to a Z value of 0.35. Now we have to do an inverse transformation on 0.35 determine the corresponding mass: we multiply by the stan. dev. and add the mean, yielding about 71 g. This is the max mass (the cut-off point) for a probability of 63.8%. With Excel we could enter =NORMINV(0.638,70,3), which
yields the same answer but with more precision. j. Use the table and Excel to find the mass m for which the probability of
a random burger being less than m is 96.25%
13. As you will recall from chemistry, the higher the hydronium ion concentration, the more acidic a solution is. The acidity +is determined by the strength of the acid (how readily it dissociates into H ions and anions) and its concentration. If you
looked at thousands of solutions and measured hydronium concentration directly, your data would range by many orders +-14of magnitude. Let C = [H0] in units of mol/L. You might find values of C ranging from less than 10 to greater than 1. 3-4-4-4This range spans 14 orders of magnitude! If some of the C values are 2.3×10, 5.5×10, 7.2×10, these values might all
get lumped together in one interval in a stem-and-leaf diagram or in a histogram, due to the enormous range of the data. -4To show this kind of detail you might use intervals of width 10 mol/L, but this would not separate out C values on the -14-14order of 10. To do this you’d need intervals of width around 10 mol/L. a. About how many intervals would be required to display all of your data at this tiny width?Obviously this is not feasible. You could use fewer intervals, of
course, but then you lose detail. Or, you could widths of varying sizes, which, in a sense, is what we’ll do via a data
transformation. Logs will make the data more manageable. b. If you take the negative log of each C value, about how
much does the data range now? Let’s call the transformed data T values. c. What are these T values known as in
chemistry? d. If your T intervals are all one unit in width, does this correspond to C intervals of equal width? If not,
which T intervals correspond to wider C intervals? Another nice application of logs comes about when dealing with xexponential relationships. Suppose y = (3.2) is a relationship that holds between two quantities, x and y. If you’re trying
to discover this relationship via scientific means, you might make measurements, plot many ordered pairs and notice that an exponential curve might fit these data well. But, due to limited data and measurement error, it might not be obvious whether the data is best described by an exponential, polynomial, or some other function. Moreover, even if you can decide on an exponential function, it will not necessarily be apparent exactly what the base is. Not to fear—logarithms
will come to the rescue! If you suspect an exponential relationship, instead of graphing y vs. x, graph log y vs. x. e.
Explain how doing this will allow you to determine whether the original data really are related exponentially, and, if so, 3.2how it allows you to find the base. Suppose now y = x. This is not exponential; it’s a power function. f. If you suspect
data are related by a power function, explain how you could use logs and graph transformed quantities to determine if this is the case and, if so, how to find the power.
14. a. If you added 10 head bonks to the count in each Stooge episode and made a new histogram with intervals of the same widths as the original, how would the new histogram differ? b. Use Excel to double all the noggin busters and create a new histogram still with intervals widths of 10 units. The shape of this histogram will be appear a little different than the original one. c. Make one more histogram, this time with the doubled data separated into intervals of width 20 units. This one should look just like the original.
15. Transforming data via square roots can, like logs, make data sets that span many orders of magnitude more manageable. See the pic on page 152. a. At the high end, gaps between data are [ compressed / expanded ]. For data values between zero and one, these gaps are [ compressed / expanded ]. b. Which compresses and expands the gaps to a greater extent, logs or roots. Explain briefly. c. There could be a disadvantage to using logs with continuous, positive data when some of the values are very small. Why?
16. The table to the left contains some data I concocted. It shows the number of bullet holes in Snoopy’s flying doghouse after combat mission against his archenemy, the Red Baron. Fortunately, Snoopy himself never gets hit, and he repairs his doghouse after each mission. As you can see from the data, some missions are much more successful than others.
Aerial Combat 1 2 4 9 10 11 12 18 19 23 25 27 28 29 31 Date (October)
Bullet Holes in 6 1 115 27 2 45 19 348 75 3 17 99 502 16 1850 Doghouse
Copy and paste the data into Excel vertically. Do this by copying, right clicking in the spreadsheet, choosing Paste
Special, and choosing Transpose. Then sort the data so it is in nondecreasing order. a. How many orders of magnitude do the data values span? b. How is the data skewed, to the right or left? c. If you were to create a stem-and-leaf with stems of 0, 1, 2, 3, … , with stems representing tens of bullet holes, about how many stems would you need? This is obviously impractical. d. Prepare a diagram using the stems 0, 0, 1, 1, 2, 2, … , 18, 18, with stem units in 100’s of holes. Each row
in your diagram is an increase of 50, but there are still an awful lot of rows. Even worse, much of the data gets bunched up in the first couple rows, while many rows are left blank. It’s time for a transformation! Create a spreadsheet
like the one on the left. Make it so the exponent, n,
and the base, b, can be changed. At the bottom
include a row for column averages and standard Aerial Bullet Holes Log(x)
deviations. Note the with the original data, the Combat in in base nstandard deviation is so large that just one standard x Date Doghouse, x b
deviation either side of the mean incorporates 2 1 1.000 0 n = 0.5 negative numbers, which are meaningless. The 10 2 1.414 0.301 square root and log base 10 transformations shown 23 3 1.732 0.477 base = 10 have smaller means and, proportionally, even smaller
1 standard deviations, compared to the original data. 6 2.449 0.778 (The mean in the Log column is more than 100 times 29 16 4.000 1.204 smaller the original mean, but its standard deviation 25 17 4.123 1.230 is over 500 times smaller.) e. How many standard 12 19 4.359 1.279 deviations are there from zero to the mean in each of 9 27 5.196 1.431 columns shown? Hint: you may want to make your
11 45 spreadsheet compute this automatically. f. None of 6.708 1.653
the columns shown is a normal distribution, but 19 75 8.660 1.875 which is ―more normal‖ than the other two, and why? 27 99 9.950 1.996 g. As n get smaller, what happens to the number of 4 115 10.724 2.061 standard deviations between zero and the mean? h. 18 348 18.655 2.542 What happens to the range? i. The more orders of
28 502 magnitude spanned by the original data, the [ bigger / 22.405 2.701
31 smaller ] n would have to be to ―tame‖ the data. j. As 1850 43.012 3.267 b changes, what happens the number of standard
deviations between zero and the mean? Cool, huh? k. mean 208.33 9.63 1.52
What happens to the range? The more orders of stan.
magnitude spanned by the original data, the [ bigger / dev. 476.36 11.13 0.92
smaller ] b would have to be to ―tame‖ the data. Let’s work with the log-transformed data in base 10 for a moment.
Notice that on there were 115/19 ！ 6.05 times as many bullet holes on Oct. 4 than on Oct. 12. Look at the corresponding Log entries. Subtracting, we get 2.061 – 1.279 = 0.782. The inverse transformation of this is the antilog of 0.782, that is, 0.78210 ！ 6.05. Last part to this question: Give a very short proof that this works in general (for any base and for any two data entries).
17. In a normal distribution the mean, median, and mode are all the same. Describe a ―non-normal‖ distribution that has this
18. Suppose you live in a middle class neighborhood of 110 families. Family income is distributed approximately normally, with a mean of 50 and a standard deviation of 2.5 (both in units of thousands of dollars). One day Bill Gates decides to move into a house down the block. a. Qualitatively, how are the mean and median affected by this. b. How are the standard deviation and interquartile range affected? c. After Gates moves in, which pair of descriptors is more informative: mean and standard deviation; or median and interquartile range? Why? d. If Gates hadn’t moved in but every family in the neighborhood got a 10 grand raise, how would this affect all four of the aforementioned descriptors? e. With Gates there, what would the trimmed mean be? f. This time, no Gates, but the families making just above 50 grand get a small raise, and the families just under 50 get a small pay cut. How does this affect the shape of the curve? g. Given ththe original information, about how many families make between $47,500 and $52,500? h. What is the 75 percentile for
income? i. How much must a family’s income be in order to be in the top 10%? k. What is the probability that a
randomly chosen family makes between 35 and 40 grand? Last part of the question: What range of incomes, centered at
the mean, corresponds to a 50% likelihood for a random pick? In other words, where should you start and stop measuring
area under the curve such that your starting and stopping points are symmetric with respect to the mean and the area is
half of the total area? (It’s a fairly small range.)
19. Check out the graph on page 179. Give ballpark estimates of the following (no calculations) with a very brief reason: a.
the probability that a random element of the population has a value greater than the mean; b. the interquartile range
divided by the standard deviation; c. the portion of the area under the curve that is not shaded. 20. Check out Display 6.7 on page 179. a. What is the minimum value among the original data? b. That minimum is changed
to what in the second row? c. As the original minimum is changed to larger and larger values, how does it affect the four
descriptors? d. Which pair of descriptors is more sensitive to outliers? e. Which pair tends to ignore extreme values? 21. A large standard deviation on test score data could mean that people’s scores varied greatly. If this is the case, then the
standard deviation is a useful measure. However, there could be another explanation for a large standard deviation.
Explain how most of the people could have gotten B’s on the test, yet the standard deviation is still high. What would be
a more useful descriptor in this case?
22. Here’s a simple, interesting puzzle. Say you have a list with an even number of data. a. You split the data into two
equally sized groups, average each group, and then average the averages. How does this value compare to the average of
the entire list (same or different)? b. What if you split the list into groups of unequal size. c. Explain how the latter is
like a weighted average. d. You split the data into two equally sized groups, find the standard deviation of each group,
and then average those standard deviations. How does this value compare to the standard deviation of the entire list (same
or different)? e. What if the two groups are the same size and contain identical data?