The collinearity statistics provide information to allow the analyst to

By Travis Sanders,2014-03-28 04:54
7 views 0
The collinearity statistics provide information to allow the analyst toto,The,allow,the

    Multicollinearity in Multiple Linear Regression using Ordinary Least Squares

    Prepared by Robert L. Andrews

    The collinearity diagnostic measures (statistics) provide information to allow the analyst to detect when the regression independent variables are intercorrelated to the degree that the regression output may be adversely affected. Interrelatedness of the independent variables creates what is termed as an ill-conditioned X’X matrix. The process for inverting the matrix

    and calculating the regression coefficient estimates becomes unstable increasing the likelihood of unreasonable estimates. Multicollinearity or collinearity measures of interest include: ; Bivariate Correlations measure the magnitude of the linear relationship between two

    variables. If two variables that are included as independent variables in a multiple regression

    analysis and are highly correlated (positively or negatively) then these variables clearly

    violate the assumption of independence making the Ordinary Least Squares, OLS, process

    unstable for estimating the regression coefficients. However using bivariate correlations

    alone may not detect linear relations between multiple variables, one should also consider

    measures for multiple variables. 22; Tolerance (a measure calculated for each independent variable) is 1 R, R uses the

    specific independent variable as the dependent, Y, and X is all of the other original

    independents. Tolerance represents the proportion of variability that is not explained by the

    other independent variables in the regression model. When tolerance is close to 0, most of

    the variability for the variable can be explained by other independent variables. Hence, there

    is high multicollinearity due to that variable’s linear relationship with other independents.

    One effect is the increased variance of the OLS regression coefficient estimate for that


    ; Variance Inflation Factor,VIF, (a measure calculated for each variable) is simply the

    reciprocal of tolerance, 1/Tolerance. It measures the degree to which the interrelatedness of

    the variable with other predictor variables inflates the variance of the estimated regression

    coefficient for that variable. Hence the square root of the VIF is the degree to which the

    collinearity has increased the standard error for that variable’s coeeficient. A high VIF value

    indicates high collinearity of that variable with other independents and instability of the

    regression coefficient estimation process. There are no statistical tests to test for

    multicollinearity using the tolerance or VIF measures. VIF=1 is ideal and many authors use

    VIF=10 as a suggested upper limit for indicting a definite multicollinearity problem for

    an individual variable (VIF=10 inflates the Standard Error by 3.16). Some would consider

    VIF=4 (doubling the Standard Error) as a minimum for indicating a possible

    multicollinearity problem.

    Guideline for Interpreting VIF Values

    Acceptable < 4 < Gray Area, Possible Collinearity Issue < 10 < Serious Collinearity Issue

; Condition Index values are calculated from the eigenvalues for a rescaled crossproduct X’X

    matrix. Unlike tolerance and VIF measures that are for individual variables, condition index

    values are for individual dimensions/components/factors and measure the amount of the

    variability it accounts for in the rescaled crossproduct X’X matrix. The rescaled

    crossproduct X’X matrix values are obtained by dividing each original value by the square

    root of the sum of squared original values for that column in the original matrix, including

    those for the intercept. This yields an X’X matrix with ones on the main diagonal.

    Eigenvalues close to 0 indicate dimensions which explain little variability. A wide spread in

    eigenvalues indicates an ill-conditioned crossproduct matrix, meaning there is a problem with

    multicollinearity. A condition index is calculated for each dimension/component/factor by

    taking the square root of ratio of the largest eigenvalue divided by the eigenvalue for the

    dimension. A common rule of thumb is that a condition index over 15 indicates a possible

    multicollinearity problem and a condition index over 30 suggests a serious

    multicollinearity problem. Since each dimension is a linear combination of the original

    variables the analyst using OLS regression is not able to merely exclude the problematic

    dimension. Hence a guide is needed to determine which variables are associated with the

    problematic dimension.

    ; Regression Coefficient Variance Decomposition Proportions provide a breakdown or

    decomposition of the variance associated with each regression coefficient, including the

    intercept. This breakdown is according to the individual dimensions/components/factors and

    reported as a percentage of the total variance for that coefficient that is associated with the

    respective dimension. The VIF measures are based on the fact that interrelatedness of a

    variable with other predictor variables inflates the variance of the estimated regression

    coefficient for that variable. Since the only cure available to the analyst using standard OLS

    is the selection of independent variables included in the model, then some measure(s) must

    be provided to better pinpoint variables that contribute to the instability of the estimation

    process associated with inverting the crossproduct X’X matrix. Since interrelatedness must

    involve more than one variable then one looks to the dimensions with a high index value to

    see if the proportion of variance is high for two or more variables.

    Hence Belsley, Kuh and Welsch propose that degrading collinearity exists when one observes at least one dimension with both

    1. a high condition index (generally accepted guide is value greater than 30) and

    2. two or more estimated regression coefficient’s variance decomposition

    proportions are high (generally accepted guide is value greater than .5).

    No clear prescription exits for the best way to eliminate a multicollinearity problem. One can perform a factor analysis using a principal component solution on the set of original variables and calculate factor scores for the respective dimensions. Then a linear regression can be performed on the non-problematic dimensions (those with low condition index values). However interpreting the regression coefficients becomes very difficult since each dimension/component/factor is a linear combination of all of the variables. Generally the analyst looks to eliminate variables that are problematic. The analyst has multiple objectives in building a good model to describe the linear relationship between a dependent variable and a set of independent variables. Eliminating collinearity problems is just one of those objectives. Hence the analyst may want to try multiple models to see which one seems to be best, given these multiple objectives. Statistical packages provide automated solution methods such as Forward, Backward & Stepwise that can be helpful in addition to trying specific models that make sense to the analyst.

Information sources: thMultivariate Data Analysis (7 edition) by Hair, Black, Babin, Anderson & Tatham.

    Regression Diagnostics: Identifying Influential Data and Sources of Collinearity (1980) by

    Belsley, Kuh & Welch

Report this document

For any questions or suggestions please email