In follow-up to the post about univariate outliers, there are a few ways we can identify the extent of bivariate and multivariate outliers:
First, do the univariate outlier checks and with those findings in mind (and with no immediate remedial action), follow some, or all of these bivariate or multivariate outlier identifications depending on the type of analysis you are planning.
For one-way ANOVA, we can use the GLM (univariate) procedure to save standardised or studentized residuals. Then do a normal probability plot of these residual values and a diagonal straight line would indicate if the residuals have a normal distribution. Any serious deviations from this diagonal line will indicate possible outlier cases.
Correlation: In additional to univariate outlier detection, scatter plotting is an easy way to spot outliers visually. Also, a large difference between Pearson’s correlation and Spearman’s rho may also indicate the presence of serious outliers.
Regression: Save the standardized or studentized residuals (z-scores). Any z-scores beyond e.g. +1.96 (1 SD) could be an outlier, or to be less conservative, you may want to use 2.58 (2 SD) or 3.29 (3 SD). The recommendation is to use a 95% confidence level (alpha=.05) so any values higher than 1.96 (or lower than -1.96) can be regarded as outliers, and therefore candidates for remedial action.
- An absolute z-score greater than +1.96 (1 SD) is significant at p‘<‘.05. 95% confidence level (which cuts off 2.5% at lower and upper limits)
- An absolute z-score greater than +2.58 (2 SD) is significant at p‘<‘.01. 99% confidence level (which cuts off 0.5% at lower and upper limits)
- An absolute z-score greater than +3.29 (3 SD) is significant at p‘<‘.001. 99.9% confidence level (which cuts off 0.05% at lower and upper limits)
Chi-square uses categorical data and as we are not computing means and standard deviations, there is no need to be concerned about outliers.
Talk the talk: When we do a significance test at p‘<‘.01 , then we say at the “99% confidence level”…and “at the 1% significance level”.
Once we have more than two variables in our equation, bivariate outlier detection becomes inadequate as bivariate variables can be displayed in easy to understand two-dimensional plots while multivariate’s multidimensional plots become a bit confusing to most of us. Therefore, a few multivariate outlier detection procedures are available. Among them is the Mahalanobis distance. Other procedures such as Cook’s D, as well as the Leverage values, are also helpful to identify multivariate outliers. Each of these are available in software such as SPSS and each have their own heuristics.
Note that in addition to the Mahalanobis D, Cook’s D, and Leverage values, we can (and should) also look at “Influential statistics” which include the Standardized DfBeta, Standardized DFFIT, Standardized / Studentized residuals, and the COVRATIO.
Who’s Cook, you may ask. He is R. Dennis Cook who in 1977 introduced the concept of how to estimate the influence of a data point when performing OLS (regression).
…and who is Mahalanobis? He is Prasanta Chandra Mahalanobis, a Cambridge educated Indian (Bengali) scientist and applied statistician who lived between 1893 and 1972.
…and who is Leverage? Well, it is nobody. Leverage is a statistical term for identifying observations that lie too far away from their corresponding average predictor values.