Regression models are kind of like functions (y = ax + b) that can predict the value of a variable. The simplest models only take one variable as input, e.g. one that tries to predict your monthly electricity bill based on your household size. Many models use more than one variable. For example, a muffin baker might try to predict their monthly sales based on the price and size of their muffins.
When a model involves multiple input variables, some of those variables may express roughly the same thing. That muffin baker may have included various variables related to size, like radius, volume and weight, which – assuming that different-sized muffins are similarly shaped – are likely to be highly correlated with each other.
This is called and it is something that you don’t want too much of, as it results in models that are less accurate. It also becomes less clear how much each variable contributes to the outcome of predictions: because various variables are almost interchangeable, their relative values may vary dramatically and might as well be arbitrary!
The variance inflation factor (VIF) is a diagnostic measure that can be used to identify variables that have a strong linear relationship with other variable(s). High VIF values are thought to be problematic, and consequently most software engineering studies that use regression rids their model of “problematic” variables that have VIF values above a certain predefined threshold.
This paper discuss some of the problems that may arise when researchers blindly act on rules of thumb about “bad” VIF values.
The introduction of this article suggested that multicollinearity reduces the stability of your regression coefficients. While true, it is far from the only factor that can do so.
More data in the form of large sample sizes can often reduce the variance of coefficients, which gives you more confidence that the weights assigned to each variable are accurate.
The paper illustrates this using the following example:
The baseline model looks pretty alright: the ith regression coefficient is statistically significant, a VIF of 1.25 is well below the commonly used threshold values (4 and 10), and the sample size also seems adequate. The R2y is not great, but few researchers would really make a problem out of it.
The result for “Comparison 1” shows a VIF of 20.00, which many researchers would consider to be greatly upsetting. It would be strange to question these results however, as every other number in this row is an improvement over the baseline: the sample size is much larger, the variance of the coefficient is lower, and the t-value suggests that we should be more confident about the value of the ith coefficient than in the baseline version.
The third row is a more extreme version, with an even higher VIF – but it also shows further improved accuracy and confidence intervals.
In other words: context matters. It is inappropriate to automatically question results only because the VIF is greater than 4 or 10, when t-values and confidence intervals are fine.
Many researchers work with a null hypothesis. A hypothesis can be accepted or rejected based not only on the results of some experiment, but also on some predefined threshold value, e.g. the level of statistical significance. The criteria for rejection of a hypothesis are not the same as those for a non-rejection.
In a similar vein, the VIF should only really be important when the effect of multicollinearity on the ith regression coefficient is the subject of the study. If a regression coefficient is statistically significant, it is just that: statistically significant – even with (or despite) a lot of multicollinearity.
When faced with inflation of the variance of a regression coefficient due to multicollinearity, most researchers deal with it by attempting to reduce multicollinearity.
A commonly used solution is to eliminate one or more independent variables that are highly correlated with the other independent variables. The consequence of such a “solution” is that the model often no longer corresponds with the theory that one was going to test. In other words, it’s not theoretically well motivated.
Having said that, there are cases in which elimination or combination of highly correlated variables is acceptable because they can be theoretically motivated. One such example was discussed in the introduction: the radius, volume, and weight of muffins all essentially measure the same thing, so there is no need to include all three of them in the model.
Ridge regression is another method, but my statistics textbook doesn’t cover it and I haven’t ever seen it used in a software engineering paper, so I’m just going to pretend it doesn’t exist.
A high VIF is not necessarily problematic on its own
Only eliminate or combine highly collinear variables when it makes sense