Reading:  

Data


Outliers

Outliers

Outliers


Sometimes a set of data may contain one or more values that are a long way away from the other values in the set. We call these values outliers, and we need to be very careful about the way we deal with them.


Let's have a look at an example where an outlier occurs.

Example: Skipping

Outliers


Christo's teacher has decided to set her class a challenge to see who could improve their skipping the most over a two week period. She records the number of times each member of the class can jump over a rope in five minutes at the beginning of the two week period and at the end of the two week period.

Here are the results:

Name Before After After - Before
Claire 60 88 28
Angelyn 63 92 29
Christo 66 17 -39
Greg 61 90 29
Steve 70 95 25
Josh 59 86 27
Hmmm... everyone seems to have improved except for Christo. Why did Christo get worse?

Here are the differences plotted on a number line:

Outliers
If we calculate the mean improvement, we get
\( \text{mean} = \dfrac{28 + 29 + (-39) + 29 + 25 +27 }{6} = 15, \)
which is still an improvement, but doesn't reflect the fact that most people improved by \(25\) or more.

So, what's going on? Christo's improvement (or lack there of) is an outlier. Sometimes the best thing to do with outliers is to discard them. Let's see what happens without Christo's result:

Outliers
If we calculate the mean improvement, we get
\( \text{mean} = \dfrac{28 + 29 + 29 + 25 +27 }{5} = 27.6, \)
which seems to reflect the improvement of everyone else much better than the preceding mean.

Is this the right thing to do? Can we just throw away data values that make things look bad?

Dealing with Outliers

We can't just throw away data values without a good reason. Otherwise, we can be accused of fudging our results. Sometimes it is quite reasonable to have values that are much higher or much lower than other values. For example,

  • Dimensions can be smaller or larger than other dimensions: e.g. people can be heavier or lighter, shorter or taller.
  • People can have bad days.
  • Plants grow better if they get enough sunlight, nutrients and water.

There may be a good reason for the strange data values that we haven't accounted for.

Let's see if we can find a reason for Christo's bad performance.

Skipping Example (continued)

It turns out that Christo decided it would be a good idea to see if he could juggle his soft toys while he skipped on the second day. He'd jump over the rope, throw his toys up into the air, jump over the rope, catch them, and so on. Consequently, it took him a lot longer to complete each jump, and he couldn't complete anywhere near the same number of jumps in 5 minutes.

So, Christo's result was rubbish, and deserved to be thrown away.

In some cases, however, it really isn't a good idea to discard outliers. We need to consider each situation individually before making our decision. We also need to be able to justify our decisions when we write our report.

Effects of Outliers on the Mean, Median and Mode

In the example, we saw that the presence of outliers can have a huge effect on the mean. What about the median and mode?

The median for our data set

  • With Christo was 27.5
  • Without Christo was 28
So, Christo's result didn't affect the median much.

The mode for our data set

  • With Christo was 29
  • Without Christo was 29
So, Christo's result didn't change the mode at all.

The mean and median remained around most of the data values. These measures give a better indication of trends in a data set that includes outliers. The mean is not so reliable.

Description

This chapter series is on Data and is suitable for Year 10 or higher students, topics include

  • Accuracy and Precision
  • Calculating Means From Frequency Tables
  • Correlation
  • Cumulative Tables and Graphs
  • Discrete and Continuous Data
  • Finding the Mean
  • Finding the Median
  • FindingtheMode
  • Formulas for Standard Deviation
  • Grouped Frequency Distribution
  • Normal Distribution
  • Outliers
  • Quartiles
  • Quincunx
  • Quincunx Explained
  • Range (Statistics)
  • Skewed Data
  • Standard Deviation and Variance
  • Standard Normal Table
  • Univariate and Bivariate Data
  • What is Data

 



Audience

Year 10 or higher students, some chapters suitable for students in Year 8 or higher

Learning Objectives

Learn about topics related to "Data"

Author: Subject Coach
Added on: 28th Sep 2018

You must be logged in as Student to ask a Question.

None just yet!