Introduction
The both data issues are totally different from each other. One has the capability to totally destroy the statistical data analysis and the second issue has no role in model training , whether present or not in the dataset there will be no impact on the prediction of the trained model.
Unraveling Outliers
A data point that deviates significantly from the other dataset because of natural variation or any type of measurement error. It can distort statistical errors and inaccurate predictions.
Identification Techniques
To identify the outliers in the dataset we have two options depending on the complexity and size of the data we can prefer one of them.
- Visual Inspection: In this technique, we use Boxplot, Scatter plot, and Histogram to visualize the outliers, here we identify the outliers by checking which point in the plot is lining far from the central tendency.
- Statistical Inspection: Quantitative measures such as z-scores, interquartile range (IQR), and Mahalanobis distance offer objective criteria for identifying outliers based on their deviation from the mean or median.
Handling Strategies
In data cleaning, various techniques are employed to handle outliers effectively. Trimming involves excluding extreme values based on predefined thresholds or percentiles, thereby removing outliers from subsequent analyses. Winsorization , on the other hand, caps extreme values by replacing them with less extreme values, typically at the tails of the data distribution, mitigating their impact without entirely discarding them. Another approach is transformation, where mathematical transformations such as logarithmic or square root transformations are applied to normalize the distribution and mitigate the influence of outliers on statistical analyses. Additionally, model-based approaches utilize robust statistical models or algorithms, such as robust regression or tree-based methods, which are inherently less sensitive to outliers, thereby minimizing their impact on predictive performance. By leveraging these diverse strategies, analysts can effectively manage outliers and ensure the integrity and reliability of their data analyses.
Silencing the Noise
The unwanted properties or columns whose role is nothing in predicting the desired result are treated as noise, and it's our choice to remove it or keep it but It can give you some additional information.
The Outliers and Duplicate entities can also be treated as noise, this is why we understand the process of handling noise and outliers together but these types of noises make a large impact on the prediction of the models so it is very important to remove such noise to get accurate predictions.
To handle the simple noise they that just delete the whole column from the dataset.
Example
Here we will take the same data set that we have already used to demonstrate the process of handling missing values in the last article.
As usual, firstly import the libraries
import pandas as pd
import numpy as np
Now, Use the last dataset whose missing values were handed in the last article.
data = {
'Name': ['John', 'Jane', 'Alice', 'Bob', 'Chris', 'Unknown'] * 1000,
'Age': [35, 28, 42, 50, 45, 42] * 1000,
'Salary': [60000, 75000, 90000, 120000, 200000, 1500000] * 1000,
'Department_ID': [101, 102, 103, 104, 105, 106] * 1000
}
Convert this data into the data-framed
df = pd.DataFrame(data)
To analyze the data frame just have a look over the head of the data frame.
df.head(7)
Now, introduce a noise in the age column.
df.loc[df.sample(frac=0.05).index, 'Age'] = 200
When you analyze the data frame then you will find that we have outlier values in the salary column of the data frame. Let's start the process.
Q1 = df['Salary'].quantile(0.25)
Q3 = df['Salary'].quantile(0.75)
IQR = Q3 - Q1
df = df[(df['Salary'] > (Q1 - 1.5 * IQR)) & (df['Salary'] < (Q3 + 1.5 * IQR))]
It is the most preferred method for handling the outlier values.
Now to handle the noise in the age column we have introduced a realistic range for the age and we can have 100 as the maximum limit of the age column values.
df = df[df['Age'] < 100]
Now reset the indexing of the cleaned data set.
df.reset_index(drop=True, inplace=True)
Now, just have a look at the cleaned data set.
print("Cleaned Dataset:")
print(df)
Note. This example only explains the process of Data Cleaning for further implementation please use different data sets.
Conclusion
Outliers and noise, while seemingly similar, pose distinct challenges in data cleaning. Outliers are genuine data points that deviate significantly from the norm, potentially skewing statistical analyses. We can identify them visually or statistically and address them through trimming, winsorization, transformation, or employing robust models. In contrast, noise represents irrelevant data with no bearing on the target variable. While removable, careful consideration is necessary, as noise may hold unforeseen insights. By effectively handling both outliers and noise, we ensure clean, reliable data that fuels accurate and robust statistical analyses and machine learning models. Remember, the choice of techniques hinges on the specific data and the intended analysis.