Introduction
The lifeblood of any successful AI model is clean, high-quality data. However, raw data often comes tangled in inconsistencies and duplicates, which can lead to misleading results and hinder your model's performance. Fear not, data wranglers! This article equips you with the knowledge and techniques to tackle these data demons in the data-cleaning process.
Identifying The Inconsistent Data
The Inconsistent entity of the data set arises because of discrepancies in formats, units of measurement, or coding conventions. It can lead to erroneous interpretations and inaccurate model predictions. E.g: Date formats (e.g., "YYYY-MM-DD" vs. "MM/DD/YYYY"). It can be identified by Pattern Matching, Cross Validation, and Domain Knowledge of the dataset. After identifying them we can proceed to the handling process.
Inconsistency Handling Techniques
- Formatting: Standardize date formats, units, and special characters using dedicated functions or libraries.
- Encoding: Encode categorical variables using techniques like one-hot encoding or label encoding.
- Imputation: Fill in missing values with estimates using methods like mean/median imputation or K-Nearest Neighbors (KNN)
Identifying and handling Duplicates
When Identical data points are present multiple times in the data set then these points are treated as Duplicates. It affects the prediction of the model; the presence of multiple identical data points in the same dataset confuses the model at the time of prediction.
We can remove duplicates from our dataset
- Deletion: Remove identified duplicates while ensuring you don't discard valuable data points.
- Merging: If duplicates have slightly different values (e.g., typos), consider merging them into a single entry with corrected information.
Example
Here we will understand both the data cleaning steps with an example.
Firstly, import the libraries; we will require only pandas.
import pandas as pd
Now let's create a dataset have inconsistence and duplicate data.
data = {
'ID': [1, 2, 3, 4, 5, 2, 6],
'Name': ['Alaukik Nandan', 'Prabhat Kumar', 'Avinash', 'Adarsh Mishra', 'Lekhika ', 'Prabhat Kumar', 'Keshav'],
'Age': [26, 25, 'Unknown', 24, 24, 25, 19],
'Salary': [50000, 60000, 70000, 'N/A', 90000, 60000, 100000]
}
Convert the data into the data frame.
df = pd.DataFrame(data)
Now, just a look at the original data frame before we start handling the data issues.
print("Original Dataset with Inconsistencies and Duplicates:")
print(df)
print()
Now convert the name and salary column to numeric and replace all non-numeric with NaN.
df['Age'] = pd.to_numeric(df['Age'], errors='coerce')
df['Salary'] = pd.to_numeric(df['Salary'], errors='coerce')
Now, fill in the missing values with the medians of that column.
median_age = df['Age'].median()
df['Age'].fillna(median_age, inplace=True)
median_salary = df['Salary'].median()
df['Salary'].fillna(median_salary, inplace=True)
Handle the duplicate values. We have only two possible ways to handle the duplicate data; here, we will drop the duplicate values from the data frame.
df.drop_duplicates(inplace=True)
Let's see the clean data set.
print("Cleaned Dataset without Inconsistencies and Duplicates:")
print(df)
Conclusion
While raw data fuels AI models, inconsistencies and duplicates can act like hidden landmines, leading to misleading results. This article equipped you with techniques to identify and disarm these data demons. Techniques like standardization for consistent formats and units, imputation for missing values, and duplicate removal ensure your model interprets the data accurately. Remember, clean data is the cornerstone of powerful AI. By investing in data cleaning upfront, you build a strong foundation for your models to deliver optimal results.