Data Science  

How to Detect Outlier and Remove in Machine Learning

Data

In data science, data is either categorical or continuous form or data source as per video, everything which can be recorded has been data.

Outliers: When we collect or buy any data, we fill it with data that is beyond the minimum data value or maximum data value, and we call it outlier.

Outliers and Their Treatment

Outliers

Outliers are data points that deviate significantly from the rest of the dataset, results and affecting the performance of machine learning models. Properly identifying and treating outliers is essential for maintaining the integrity of data analysis.

How to Treat?

Looking at Graphs

  • Box Plot: This graph shows how your data spreads out. Points that are far outside the main range are likely outliers.
  • Scatter Plot: It shows data points scattered on a graph. If a point is far away from the others, it might be an outlier.
  • Histogram: This divides data into bars to show its spread. If a bar looks very different from the rest, it could point to an outlier.
  • When to use: Great for quickly spotting issues in small datasets, but might miss things in bigger or complex data.

Math-Based Methods

  • Z-Score: This checks how far a data point is from the average. If it’s more than 3 steps away (or less than -3), it’s probably an outlier. Works best when data follows a normal pattern.
  • Interquartile Range (IQR): Split your data into four parts and look at the middle half (Q1 to Q3). Points below Q1−1.5×IQR or above Q3+1.5×IQR are outliers. This works even if your data isn’t normal.

Treatment of Outliers

  • Removal: Completely eliminating outlier data points from the dataset. This approach is straightforward but may lead to loss of valuable information, especially if outliers are useful.
  • Capping: Setting upper and lower limits for outlier values, effectively reducing their impact without removing them entirely.
  • Transformation: Applying mathematical transformations reduce skewness and lessen the effect of outliers on statistical analyses.
  • Imputation: Replacing outlier values with more typical values, such as the mean or median of the dataset.
  • Modeling Separately: In some cases, it may be beneficial to create a separate model for outliers or treat them as a distinct category within the analysis.

Example

I have taken a dataset from Kaggle which is weather-anomalies-1964-2013, this is the complete weather data, I have shown it by removing all the outliers.

To access the data set or to install python libraries on it, we can use online tools like Google Coleb or Anaconda's Jupiter notebook.

Step 1. Import All needfull Library or dateset.

In this I have installed Python's needful Library and also we will read our CSV file using Pandas.

CSV file

Step 2. Now we will read all the information of the file.

Since we have taken this data from Kaggle, we do not need to purify it, we do not have to complete all the setup steps, then we can work on the data.

Kaggle

Step 3. In the same way we will see its characteristics like Mean or median.

PRINT

Step 4. We will plot our data and from the histogram and box plot we will know which column is our outlier data.

Outlier data

Step 5. Now we will apply Perrotlier's methods to the data such as IQR and Z Score our data is in mathematical form.

IQR

Step 6. We will print the data that has been filtered and will also print the calculated value along with it.

Calculated value

Step 7. After initializing the data, we have initialized the model or printed it to see how many outliers we have removed and then run a poll to check that we do not have any outlier data left.

Outlier data left

Conclusion

Effectively managing outliers is critical for ensuring accurate data analysis and improving model performance in machine learning. The choice of detection and treatment methods should consider the nature of the data, the context of analysis, and specific project goals. By applying appropriate techniques, analysts can enhance data quality and derive more meaningful insights from their datasets.

CDN Solutions Group a leading development company, started off as a team of four in the year 2000.