DOC

The collinearity statistics provide information to allow the analyst to

By Travis Sanders,2014-03-28 04:54
7 views 0
The collinearity statistics provide information to allow the analyst toto,The,allow,the

Multicollinearity in Multiple Linear Regression using Ordinary Least Squares

Prepared by Robert L. Andrews

The collinearity diagnostic measures (statistics) provide information to allow the analyst to detect when the regression independent variables are intercorrelated to the degree that the regression output may be adversely affected. Interrelatedness of the independent variables creates what is termed as an ill-conditioned X’X matrix. The process for inverting the matrix

and calculating the regression coefficient estimates becomes unstable increasing the likelihood of unreasonable estimates. Multicollinearity or collinearity measures of interest include: ; Bivariate Correlations measure the magnitude of the linear relationship between two

variables. If two variables that are included as independent variables in a multiple regression

analysis and are highly correlated (positively or negatively) then these variables clearly

violate the assumption of independence making the Ordinary Least Squares, OLS, process

unstable for estimating the regression coefficients. However using bivariate correlations

alone may not detect linear relations between multiple variables, one should also consider

measures for multiple variables. 22; Tolerance (a measure calculated for each independent variable) is 1 R, R uses the

specific independent variable as the dependent, Y, and X is all of the other original

independents. Tolerance represents the proportion of variability that is not explained by the

other independent variables in the regression model. When tolerance is close to 0, most of

the variability for the variable can be explained by other independent variables. Hence, there

is high multicollinearity due to that variable’s linear relationship with other independents.

One effect is the increased variance of the OLS regression coefficient estimate for that

variable.

; Variance Inflation Factor,VIF, (a measure calculated for each variable) is simply the

reciprocal of tolerance, 1/Tolerance. It measures the degree to which the interrelatedness of

the variable with other predictor variables inflates the variance of the estimated regression

coefficient for that variable. Hence the square root of the VIF is the degree to which the

collinearity has increased the standard error for that variable’s coeeficient. A high VIF value

indicates high collinearity of that variable with other independents and instability of the

regression coefficient estimation process. There are no statistical tests to test for

multicollinearity using the tolerance or VIF measures. VIF=1 is ideal and many authors use

VIF=10 as a suggested upper limit for indicting a definite multicollinearity problem for

an individual variable (VIF=10 inflates the Standard Error by 3.16). Some would consider

VIF=4 (doubling the Standard Error) as a minimum for indicating a possible

multicollinearity problem.

Guideline for Interpreting VIF Values

Acceptable < 4 < Gray Area, Possible Collinearity Issue < 10 < Serious Collinearity Issue

; Condition Index values are calculated from the eigenvalues for a rescaled crossproduct X’X

matrix. Unlike tolerance and VIF measures that are for individual variables, condition index

values are for individual dimensions/components/factors and measure the amount of the

variability it accounts for in the rescaled crossproduct X’X matrix. The rescaled

crossproduct X’X matrix values are obtained by dividing each original value by the square

root of the sum of squared original values for that column in the original matrix, including

those for the intercept. This yields an X’X matrix with ones on the main diagonal.

Eigenvalues close to 0 indicate dimensions which explain little variability. A wide spread in

eigenvalues indicates an ill-conditioned crossproduct matrix, meaning there is a problem with

multicollinearity. A condition index is calculated for each dimension/component/factor by

taking the square root of ratio of the largest eigenvalue divided by the eigenvalue for the

dimension. A common rule of thumb is that a condition index over 15 indicates a possible

multicollinearity problem and a condition index over 30 suggests a serious

multicollinearity problem. Since each dimension is a linear combination of the original

variables the analyst using OLS regression is not able to merely exclude the problematic

dimension. Hence a guide is needed to determine which variables are associated with the

problematic dimension.

; Regression Coefficient Variance Decomposition Proportions provide a breakdown or

decomposition of the variance associated with each regression coefficient, including the

intercept. This breakdown is according to the individual dimensions/components/factors and

reported as a percentage of the total variance for that coefficient that is associated with the

respective dimension. The VIF measures are based on the fact that interrelatedness of a

variable with other predictor variables inflates the variance of the estimated regression

coefficient for that variable. Since the only cure available to the analyst using standard OLS

is the selection of independent variables included in the model, then some measure(s) must

be provided to better pinpoint variables that contribute to the instability of the estimation

process associated with inverting the crossproduct X’X matrix. Since interrelatedness must

involve more than one variable then one looks to the dimensions with a high index value to

see if the proportion of variance is high for two or more variables.

Hence Belsley, Kuh and Welsch propose that degrading collinearity exists when one observes at least one dimension with both

1. a high condition index (generally accepted guide is value greater than 30) and

2. two or more estimated regression coefficient’s variance decomposition

proportions are high (generally accepted guide is value greater than .5).

No clear prescription exits for the best way to eliminate a multicollinearity problem. One can perform a factor analysis using a principal component solution on the set of original variables and calculate factor scores for the respective dimensions. Then a linear regression can be performed on the non-problematic dimensions (those with low condition index values). However interpreting the regression coefficients becomes very difficult since each dimension/component/factor is a linear combination of all of the variables. Generally the analyst looks to eliminate variables that are problematic. The analyst has multiple objectives in building a good model to describe the linear relationship between a dependent variable and a set of independent variables. Eliminating collinearity problems is just one of those objectives. Hence the analyst may want to try multiple models to see which one seems to be best, given these multiple objectives. Statistical packages provide automated solution methods such as Forward, Backward & Stepwise that can be helpful in addition to trying specific models that make sense to the analyst.

Information sources: thMultivariate Data Analysis (7 edition) by Hair, Black, Babin, Anderson & Tatham.

Regression Diagnostics: Identifying Influential Data and Sources of Collinearity (1980) by

Belsley, Kuh & Welch

Report this document

For any questions or suggestions please email
cust-service@docsford.com