When the regression work-horse looks sick

Regression, in particular simple bivariate and multiple regression (and to a much lesser extent multivariate regression which is a “multivariate general linear model” procedure) is the work-horse of many researchers. For some, it is a horse exploited to the bone when other statistical (or even non-statistical) procedures would have done a better job!  Also, many statistical procedures are based on linear regression models (often without us realising it such as the fact that the ANOVA can be explained as a simple regression model). 
At the core of many statistical analytics is linear regression models in all its many forms. Often the results don’t look right at which time it is necessary to hunt for the culprits within the data and the statistical analytical procedures. Lets specifically look at simple regression (one predictor) and multiple regression (several predictors), though many of these suggestions will apply to other linear regression based procedures.

The first step is to review any violations in the data assumptions of linear regression. While these violations can lead to obvious problems with the results, it could also lead to less obvious problems, such a bias. In fact, we can say that if our data meet these assumptions, then our regression coefficients / parameters of the regression equation are most likely unbiased. 
Key assumptions of linear regression are as follows and should be carefully checked before the final interpretation of the results: 
  1. Both the outcome and the predictor variables should be at least of an interval scale while the binary predictor variable(s) could also be dummy coded.
  2. Unbounded data: If a rating scale (i.e. 10-point) is used, check that there are no constraints to the variability of the outcome, e.g. responses that only vary between 1 and 5 which is indicative of poor scale choice). An alternative to bounded rating scales, is unbounded scales which allow respondents to express their feelings without imposing a specific “bounded” scale.
  3. Normality of the residuals and non-correlation with the predictors or outcome variables
  4. Independent observations
  5. Predictors can’t have a zero variance
  6. Predictors should be uncorrelated with “external variables”
  7. Linear bivariate relationship should exist between the outcome and predictor variables. 
  8. Homoscedasticity is required and not heteroscedasticity
  9. Independence of error terms
  10. No multicollinearity (for multiple regression)
Other problems which may be to blame:
  1. Misspecification of the regression model: While some mis-specifications can be detected by checking the data assumptions (such as whether a linear relationship exists), others are more evasive to detect. Make sure the model was not over- or under-specified. In an over-specified model we included redundant predictors (possibly will be indicated by multicollinearity), and in an under-specified model we omitted important predictors (be careful of “omitted-variable bias”).
  2. Check if some extraneous variables have an effect on the outcome variable and which should be controlled for as covariates or as “blocking factors”.
  3. Are there any interaction effects (moderation or mediation) among the predictors that have not been accounted for. 
  4. Check for some more obvious problems such as the handling of missing data and input specs in your statistics program.

The above list is by all means not comprehensive but only serves as what I believe are among the most important reasons why the regression horse sometimes does not look all too well. 


Related Posts:
Review the individual data assumptions discussed across several posts.