DataClassroom

View Original

What's an outlier?

We’ve all heard it said: “It’s an outlier”. But what does that mean, and what should we do about it - if anything?

Rules for identifying outliers

There are various mathematical methods for identifying which points should be considered “outliers”, and some conventions as to how they should be shown. One of these is the 1.5 IQR Rule (see Wikipedia), which looks at which percentile data points fall into, and says that those beyond a line drawn at 1.5 x the Interquartile Range (IQR) beyond the 75th (and 25th) percentile are to be called outliers, like the uppermost point here:

The dotted line shows the 75th percentile plus 1.5 IQR limit

Why 1.5 IQR?

It’s important to remember this rule is arbitrary even though very widely used - it was John Tukey, inventor of the boxplot himself, that came up with it. He has been quoted as saying, when asked “why 1.5?”, that “2 was too big, and 1 was too small”.

Another rule that is sometimes used is to consider points that lie further than 2 (or 3) standard deviations from the mean as outliers.

So, you can choose arbitrarily between various arbitrary rules! How confusing is that?

Look at your data!

Frankly, the concept of the calculated outlier point evolved at a time when it was hard work (remember pencil and paper?) to plot all your data points. Back then it was much easier to summarize them visually with a boxplot. As we have written earlier, with tools like DataClassroom available, you can now easily get much better insights by plotting all the points and looking at them.

And yes, for those who are now thinking “what do I need an arbitrary 1.5 IQR rule for when I can just see the outliers myself”, I couldn’t agree more.

But whether you use a rule, or just look at the points, you might see some outliers, and it will be up to you to decide how to treat them.

What to do with outliers

Outliers are a prime example of an area where data literacy is important. It is one thing to know how to perform a calculation to find outliers, and something completely different to know how you should use the result. Or indeed, if you should use the result at all.

Some possible reasons for outliers:

  • Random variation in perfectly valid data points. In any randomly sampled population, there will be outliers - someone unusually tall, for example.

  • A data point may be an outlier for a good reason. When measuring fluency in English, someone that recently moved to the US from France might score unusually badly on the test.

  • An outlier might also indicate an invalid data point. In a set of temperature readings, some might have been taken with a thermometer that was broken.

So to know what to do, you have to both look at your data and think about it.

When in doubt, do nothing!

What you shouldn’t do is to just delete or exclude the outlier points, unless you can be absolutely certain that they are in fact invalid, and why.

My default assumption would be that these are data points to be included like any other.

You might though be able to work out that a point is invalid - it might be because the value is clearly impossible. If it’s someone’s body temperature and it reads zero, that’s probably an issue with the thermometer. Or a typo. Or - are they breathing?

Sometimes, it may depend on what you are going to use the data for. In the English test score example above, you might want to ignore the French person’s score if you are evaluating how good the previous year’s English tuition had been. But if you are just wanting to know what the general level of fluency in that population is, then you’ll want to include the outliers. They are people too!

Or, an outlier might need to trigger another reaction, quite separate from the data analysis you are in the middle of. In drug tests, there can be rare but severe side effects that need to be reported. They may manifest themselves as outliers in test results.

More….

You can have a look in our online User Guide article on how to show outliers.

Wikipedia also has more detail on outliers here.