DOC

Introductory Statistics I

By Edith Phillips,2014-08-29 01:44
7 views 0
Introductory Statistics I

960:211:02 Spring 2004

    Introductory Statistics I

    Sampling Distributions and the Central Limit Theorem

I. INTRODUCTION

    In order to “bridge the gap” between probability and statistical inference, we need to study the concept of sampling distributions. Before we start deriving the key concepts, let us begin by relaxing one key element of probabilistic analysis: (total) knowledge of the population. In statistical inference, we do not know much, if anything, about the population of interest. We must use samples as a means of analysis. For example, a natural way to estimate the population mean is by using the sample mean However, is it clear that this x.

    value of the sample mean is any “good”? It is crucial to note that there are many possible (random) samples

    that could be taken, so our value of depends on the sample that was selected. Naturally, this is where x

    intuition ends and mathematics begins. We must develop some way of quantifying how useful such an

    estimator is. So, in general, statistical inference is used to draw conclusions about populations of interest by means of sample data. Such conclusions, though, must always be evaluated, and this is where probability is used. The link between statistical inference and probability is the concept of sampling distributions. Before proceeding, let us state a few key definitions:

    ; Statistic: A function of sample data (e.g., the sample mean, the sample standard deviation, the sample

    proportion). Since data is obtained through random sampling, a statistic is a random variable. Remember,

    the value of a sample statistic varies with each possible sample, so a sample statistic is, in fact, a

    random variable.

    ; Sampling Distribution: The probability distribution of a sample statistic (derived with all samples

    from the population having the same size).

    ; Estimator: Usually a sample statistic that is used to estimate the corresponding population parameter

    (e.g., the sample mean estimates the population mean, the sample standard deviation estimates the

    population standard deviation, and the sample proportion estimates the population proportion).

II. SAMPLING DISTRIBUTIONS FROM DISCRETE POPULATIONS & UNBIASEDNESS

    To motivate the previous terms, let us study an example. Suppose that we are interested in the following population: {0, 1, 2, 3, 4}. If we are interested in taking a sample of size n = 2 without replacement from this

    population, there are C = 10 possible samples. If we are able to explicitly study each sample, we could 52

    derive the theoretical sampling distribution of any statistic (e.g., the sample mean, sample variance, sample standard deviation). Table 1 illustrates this.

    Sample Prob. of 2S XS Elements Sample

    1/10 {0, 1} 0.5 0.5 0.7071

    1/10 {0, 2} 1.0 2.0 1.4142

    1/10 {0, 3} 1.5 4.5 2.1213

    1/10 {0, 4} 2.0 8.0 2.8284

    1/10 {1, 2} 1.5 0.5 0.7071

    1/10 {1, 3} 2.0 2.0 1.4142

    1/10 {1, 4} 2.5 4.5 2.1213

    1/10 {2, 3} 2.5 0.5 0.7071

    1/10 {2, 4} 3.0 2.0 1.4142

    1/10 {3, 4} 3.5 0.5 0.7071 Table 1: All ten possible samples of size n = 2 along with the associated

    sample mean, sample variance, and sample standard deviation.

    1

    It is important to note that all samples are equally likely (in this case each with probability 1/10). This is very common in practice and is always going to occur with simple random sampling (with or without replacement). The other methods of sampling (cf. Chapter 1) that we have discussed may not necessarily lead to equally likely samples, but for our purposes, we will confine ourselves to simple random sampling or assume that simple random sampling was used.

    Using the calculations in Table 1, we can now tabulate the sampling distribution for each statistic examined. The results appear in Table 2.

    (a) (b) (c)

    2 2S P(X) P(S) P(S) X S

     0.5 1/10 0.5 4/10 4/10 0.7071

     1.0 1/10 2.0 3/10 3/10 1.4142

     1.5 2/10 4.5 2/10 2/10 2.1213

     2.0 2/10 8.0 1/10 1/10 2.8284

    2.5 2/10

    3.0 1/10

    3.5 1/10 Table 2: (a) Sampling distribution of the sample mean, (b) Sampling distribution of the

    sample variance, (c) Sampling distribution of the sample standard deviation.

    Again, it is crucial to note that statistics are random variables. Therefore, just like any random variable, we can apply certain operations such as expectation and variance. Let us begin by computing the expected value, variance, and standard deviation of X.

    E(X)xP(x)(0.5)(1/10);(1.0)(1/10);;;(3.5)(1/10)2.0 X

    22222 V(X)(x)P(x)(0.52.0)(1/10);(1.02.0)(1/10);;;(3.52.0)(1/10)0.75XX

     SD(X)V(X)0.750.8660X

    The usual interpretation of these quantities applies (i.e., if sampling were done repeatedly [i.e., infinitely], the mean of the sample means would approach 2.0, the variance of the sample means would approach 0.75, and so on). Of slightly more importance follows from how these values relate to the population parameters themselves. A simple (!) computation will show that the population {0, 1, 2, 3, 4} has mean = 2.0, which is

    also the expected value of the sample mean. This result is so important that it has a special name:

    ; Unbiasedness: The property of an estimator having its expected value equal to the population

    parameter it intends to estimate.

    Unbiasedness is a very important property in estimation. In general terms, when a statistic is unbiased, its possible values are well-centered around the population parameter it estimates (as opposed to generally over-estimating it or under-estimating it). And no, your instructor did not “rig” this example so the numbers

    would work out the way they did. The fact that the sample mean is an unbiased estimator of the population mean is no accident. It is, in fact, always true. And this is precisely why the sample mean is often preferred over the sample median when estimating the population mean, despite the fact that the sample mean is wholly non-robust. The sample median can be unbiased, but this is not always true. There are situations, however, when even though the sample median may be slightly biased (or, if you prefer the double-negative “not unbiased”), it may be preferable. Such cases include sample data that appears to have very heavy tails or some skewness. Such conditions often arise from the presence of unusually low or high values (i.e., outliers), which again parlays into the strong robustness of the sample median.

    2

    The proof of the general unbiasedness of the sample mean is not terribly difficult, and so it will be demonstrated here.

    PROPOSITION: The sample mean is always an unbiased estimator of the population mean; that is,

    E(X) always.

    Proof: The proof of this result essentially follows from the manipulation of the definition of the sample

    mean and the application of certain properties of expectation. Each step is immediately justified.

    n?1~? (replacingwith its computational form) EXEX()Xi~?ni1??

    n?1~? (the expected value of a constant is a constant, so it can be factored out) EXi~?ni1??

    n1 (expectation is a linear operator, so it can be interchanged with summation) E(X)in1i

    n1 (by definition, E(X) = , for any i) n1i

    n (summing the constant n times [since there are n data points] gives n) n

     (take a wild guess)

    If you do not understand the proof, do not worry, for it is not central to this course. Your instructor has the tendency to be ultra-meticulous (some prefer the term “anal”), but I hope you grasp the main idea here: the

    sample mean is the quintessential unbiased estimator of the population mean, hence its extreme popularity in practice.

III. SAMPLING DISTRIBUTIONS FROM CONTINUOUS POPULATIONS

    With all that behind us, we should realize that the population we considered was extremely small with only five elements. While this is terribly uncommon in practice, it is excellent for pedagogical purposes. Furthermore, the population (and all possible samples) was discrete, so we were able to construct tables of possible values of sample statistics and certain probabilities. However, as was the case with continuous random variables, such tabular constructions are impossible if we sample from a continuous population. Continuous populations are theoretically infinite and thus so are the number of possible samples that can be taken from it. However, computer simulation is a powerful illustrative tool. We will demonstrate the derivation of a sampling distribution of the sample mean from various continuous populations using a computer simulation. First, though, we state some results regarding the sampling distribution of the sample mean, which will be numerically verified as we progress. Note that for some random variable X, we denote

    the mean of X by E(X) = , and the standard deviation of X by SD(X) = .

    E(X); (by unbiasedness) X

    ; SD(X)/nX

Consider a normal population with mean = 0 and standard deviation = 3. Figure 1 shows the population.

    Suppose we are interested in deriving the sampling distribution of the sample mean using samples of size n =

    2. We use a computer simulation to sample from such a normal distribution and compute the sample mean for each sample. Figure 2 shows the histogram of the distribution of the sample mean. To assess normality, a normal quantile plot (i.e., Q-Q plot) is also shown.

    3

    Population (X ~ Normal)

    Density

    0.000.020.040.060.080.100.12-10-50510

    X Figure 1: The population (X), which is normal with mean = 0 and standard deviation = 3.

    Histogram of the Distribution of Sample Mean (n=2)Normal Q-Q Plot

    FrequencySample Quantiles

    -6-4-202468020406080100

    -505-3-2-10123

    Sample MeanTheoretical Quantiles Figure 2: [Left] Histogram of sample means, [Right] Q-Q plot of sample means.

    The histogram indicates that the distribution seems normal. The Q-Q plot shows a nearly straight line, which confirms that the distribution of sample means is, in fact, normal. It is clear that the mean of the distribution of sample means is 0 (due to unbiasedness). The simulation reports that the standard deviation of the

    /n3/22.1213.distribution of the sample means is 2.0562. Theoretically, this should be These

    values are close enough. The longer the simulation is run, the closer the two numbers would be.

Now, suppose we take samples of size n = 15. Figure 3 shows the results.

    Histogram of the Distribution of Sample Mean (n=15)Normal Q-Q Plot

    FrequencySample Quantiles

    -2-1012020406080100120

    -2-1012-3-2-10123

    Sample MeanTheoretical Quantiles Figure 3: [Left] Histogram of sample means, [Right] Q-Q plot of sample means.

    4

    Again, the histogram and the Q-Q plot indicates normality. The simulation reports that the standard deviation of the distribution of the sample means is 0.7355. Theoretically, this should be /n3/150.7746. Again, this is close enough for illustrative purposes. Note also that the standard deviation in the n = 15 case is smaller than in the n = 2 case.

Now, suppose we take samples of size n = 50. Figure 4 shows the results.

    Histogram of the Distribution of Sample Mean (n=50)Normal Q-Q Plot

    FrequencySample Quantiles

    -1.0-0.50.00.51.0020406080100

    -1.0-0.50.00.51.01.5-3-2-10123

    Sample MeanTheoretical Quantiles Figure 4: [Left] Histogram of sample means, [Right] Q-Q plot of sample means.

    Again, the histogram and the Q-Q plot indicates normality. The simulation reports that the standard deviation of the distribution of the sample means is 0.4139. Theoretically, this should be /n3/500.4243. Note that this standard deviation is smaller than in the n = 15 case and the n = 2

    case.

Now, consider a uniform population with parameters = 0 and = 5. Figure 5 shows the population.

    Suppose we are interested in deriving the sampling distribution of the sample mean using samples of size n =

    2. We use a computer simulation to sample from such a uniform distribution and compute the sample mean for each sample. Figure 6 shows the histogram of the distribution of the sample mean. To assess normality (to be discussed later), a normal quantile plot (i.e., Q-Q plot) is also shown.

    Population (X ~ Uniform)

    Density

    0.150.200.25

    012345

    X Figure 5: The population (X), which is uniform with parameters = 0 and = 5.

    The histogram indicates that the distribution seems non-normal. The Q-Q plot shows a curvature, which confirms that the distribution of sample means is non-normal. It is clear that the mean of the distribution of sample means is 2.5 (due to unbiasedness). The simulation reports that the standard deviation of the

    /n[(50)/12]/21.0206.distribution of the sample means is 1.0162. Theoretically, this should be

    5

    Histogram of the Distribution of Sample Mean (n=2)Normal Q-Q Plot

    FrequencySample Quantiles

    020406080012345

    012345-3-2-10123

    Sample MeanTheoretical Quantiles Figure 6: [Left] Histogram of sample means, [Right] Q-Q plot of sample means.

Figures 7 and 8 show the same analysis using samples of size n = 15 and n = 50.

    Histogram of the Distribution of Sample Mean (n=15)Normal Q-Q Plot

    FrequencySample Quantiles

    1.52.02.53.03.5020406080100

    1.52.02.53.03.5-3-2-10123

    Sample MeanTheoretical Quantiles Figure 7: [Left] Histogram of sample means, [Right] Q-Q plot of sample means.

    Histogram of the Distribution of Sample Mean (n=50)Normal Q-Q Plot

    FrequencySample Quantiles

    0204060801001.82.02.22.42.62.83.01.82.02.22.42.62.83.0-3-2-10123

    Sample MeanTheoretical Quantiles Figure 8: [Left] Histogram of sample means, [Right] Q-Q plot of sample means.

    It should be evident that as the sample size n is increased, the histogram looks increasingly normal, and the Q-Q plots echo this assertion as the plot shows increasingly linear relationships between the sample quantiles

    and the theoretical normal quantiles. The significance of this is forthcoming. In the n = 15 case, the standard

    deviation of the distribution of the sample means is reported to be 0.3792 (theoretically, it should be 0.3727).

    In the n = 50 case, the standard deviation is reported to be 0.2114 (theoretically, it should be 0.2042).

    6

null
null

Report this document

For any questions or suggestions please email
cust-service@docsford.com