I was working on a classification problem using machine learning and while analyzing the training data, I noticed that there were few data points which didn’t fit the distribution. These data points were making the gaussian distribution not gaussian.

To improve the accuracy of the classifier, I needed to eliminate them. For that purpose, I used Tukey’s method. It uses interquartile range to find/eliminate outliers.

What is a Quartile and how to find Interquartile range?

If we divide the data into 4 sections, each containing 25% of the data, then each section is called a Quartile. The data is sorted in an ascending order. The first 25% of the data is called 1st quartile, 25% – 50% is called the 2nd quartile, 50% – 75% is called the 3rd quartile and the last 25% is called the 4th quartile.

In Python, we can calculate quartiles as follows:


def GetQuartiles(arr):
    arr = np.sort(arr)
    mid = len(arr)/2
    if(len(arr)%2 == 0):
        Q1 = np.median(arr[:mid])
        Q3 = np.median(arr[mid:])
    else:
        Q1 = np.median(arr[:mid])
        Q3 = np.median(arr[mid+1:])
    return Q1,Q3

For the dataset shown below, the quartile are:

tukey01

The distance between 3rd quartile and the 1st quartile is called Inter-Quartile Range (IQR.)

tukey02

How to detect Outliers using IQR?

Anything which lies below (1st quartile – IQR) or above (3rd quartile + IQR) are considered as outliers. But, you can multiply a small bias with IQR to include/exclude more data points.


def EliminateOutliers(arr, bias = 1):
     q1,q3 = GetQuartiles(arr)
     iqr = q3 - q1
     lowerLimit = q1 - bias*iqr
     upperLimit = q3 + bias*iqr
     return filter(lambda x: (x > lowerLimit and x < upperLimit),arr)

For bias = 1, the outliers are highlighted below.

tukey03

This is a simple method to eliminate outliers in the data and help you to train a better classifier.