Outlier cases – univariate outliers

Posted on February 14, 2022 by Introspective-Mode in Assumptions, Data Cleaning, Data Management, Outliers

Discussing the causes, impact, identification and remedial action of outliers is a lengthy subject. I will keep it short by only focussing on a few ways to identify, in this post, univariate outliers. Also refer to the post entitled: Outlier cases – bivariate and multivariate outliers.

Be reminded that with bivariate and multivariate analysis, the focus should not be on univariate outliers, though it is advisable to check them but don’t take immediate remedial action.

First and foremost, do the obvious by looking at a few visuals such as histograms, stem-and-leaf plots, box-and-whisker plots, normal probability plots, QQ detrended plots, etc. These graphs often show extreme outliers.
Specifically for categorial variables, inspection of the frequency distribution with a box-and-whisker plot for each variable will show outliers.
Specifically for continuous variables, create standardised z-scores of each variable (in bivariate regression investigate the residuals, e.g. their standardised z-scores, such as the “studentized residuals”). Note that using z-scores assumes a normal distribution. A general heuristic is that if more than 1% of all the cases have z-scores greater than +2.58 (or just +2.5), then we have an outlier problem. If any are more than +3.29 (or just +3), then we have serious outliers (and most likely candidates for remedial action).

Here’s a practical example: If our rule is to remove all z-scores outside 2.5, then if the SD is 9 and the mean is 60, then: 9 X 2.5 = 22.5. Add this to the mean: 60 + 22.5 = 82.5. So remove all cases with a mean larger than 82.5 (do the same for the bottom end of the scale). You may do this at different stringency levels i.e. 1.96, 2.58, or 3.29 (or 1SD, 2SD, 3SD). I bet you knew this!

Broadly, we have a few strategies to deal with univariate outliers including the following:

Remove the outlier cases (list wise or pairwise),
Transform the data (e.g. select the appropriate logarithmic, square root, reciprocal, reverse score etc. transformation procedure),
Change the score – either an easy change or a more complex “change-score-strategy”,
Just investigate to determine the scope of outliers and keep the findings in the back of your mind for later action or non-action. This is very applicable if you do bivariate or multivariate procedures.

Be careful with any of the above strategies, except the last one. My recommendation is to always check univariate outliers but don’t do anything yet if you are planning to do bivariate or multivariate analysis. While a data point may be a serious univariate outlier, it may not be an outlier in a bivariate or multivariate analysis – and the reverse is also true.

_________________________________________

Further Reading:

Osborne and Overbay (2004): The power of outliers (and why researchers should ALWAYS check for them)

_________________________________________

/zza95

0 0 votes

Article Rating