Before dropping so-called insignificant variables from a regression, we should be careful to consider (1) what we mean by “insignificant” (2) why the variable was included in the first place and (3) what potential costs arise if the variable is excluded.
To practitioners, insignificance arises if the p-value is less than some “critical” level,
like .05. In probability theory, this is the level of Type I error; it is the risk of falsely rejecting the null hypothesis (Ho: the coefficient on the variable is insignificantly different from zero). Lower p-values mean lower probabilities of committing Type I errors. In a capital murder trial, for example, the judge typically instructs the jury to consider the evidence in such a way as to minimize this risk – that is, to minimize the risk
of convicting an innocent defendant (the null is that the defendant is not guilty). In statistical tests, the 5% level sets a limit on the amount of risk we are willing to take. In five out of one-hundred random samples, we’d reject the null incorrectly, i.e., five
percent of the time. If we set the p-value lower, say at one percent, then we are making it even harder to reject the null. That is, we are making it harder to commit a Type I error.
So, the first question is, what p-value should we select? Well, there’s no answer to that
but suppose it is 5%. Then in a standard t-test, we’d conclude that the variable has no
explanatory power in the model if the p-value is 0.05001? Probably not. Rather, we’d
conclude that the probability of making a Type I error (rejecting the null of zero when we shouldn’t) is slightly higher than we’d otherwise like. But in no way does the decision to reject or not, mean that the variable is somehow uninformative.
That’s lesson number one. Now consider why the variable is in there to begin with.
Suppose I have a scientific experiment for which the “theory” states that X and Y cause Z. I collect a sample of data on all three and regress Z on X and Y jointly. Suppose the estimated coefficient on X is deemed insignificant because its p-value is 0.0499. Theoretically, this can’t happen (it’s like rejecting gravity because the apple that fell from
the tree onto Isaac Newton’s head didn’t fall fast enough). X is supposed to cause Z. But
in random samples, there is measurement error and coefficients are necessarily measured imprecisely. Noisy data cause standard errors to rise and t-statistics to fall. So, should we reject X? Or should we say that the contribution X has in explaining Z is measured imprecisely?
That’s lesson number two. X was included because it should be there. Even rejecting X’s contribution has risk – in this case 0.0499 risk of Type I error. Because this number is less than 0.05 is not a legitimate reason to exclude X. Which finally brings us to our
third consideration on what potential costs there are to excluding a so-called “insignificant” variable. Suppose we go ahead and exclude X. In the statistics literature, X is formally an erroneously excluded variable and so the equation governing Z is now misspecified. Forget that though. Think instead of where X is. X now resides in the error term (it is now part of the unobserved, albeit random, part of Z. The error term in any regression is the composite of all excluded unobserved influences on Z. We usually assume this term is zero on average and uncorrelated with any of the included regressors. Now, it includes X too. Well, if X is correlated with Y, and there is no reason to think
otherwise, then excluding X will make the estimate on the coefficient on Y biased. And that’s lesson number three.
In sum, the decision to say a variable is insignificant is a subjective choice and practitioners falsely think p-values are somehow objectively applied. They are not. Secondly, so-called insignificance is a statement about precision in estimation and not about causality. So there is absolutely no reason to drop variables whose coefficients are somehow deemed insignificant. Thirdly, dropped variables are now part of the error term and to the extent these dropped variables are correlated with any included regressors then coefficients on those regressors will be estimated with bias. This is a much more serious problem than leaving in regressors with imprecisely estimated coefficients.