Data cleansing is a vital step in the data preparation process to ensure that the data used for analysis, reporting, or machine learning is accurate, consistent, and reliable. Poor-quality data can lead to incorrect conclusions, flawed models, and inefficient decision-making. This article explores common data cleansing techniques in SQL Server with examples derived from the AdventureWorks2022 database.
1. Removing Duplicates
Duplicates in datasets can skew results, inflate counts, or cause redundancy. SQL Server provides ways to remove duplicates using DISTINCT or GROUP BY.
Example. Identify and remove duplicate rows in the Person.EmailAddress table.
If you want to remove duplicates permanently, you can use CTE (Common Table Expressions) with ROW_NUMBER().
USE AdventureWorks2022;
GO
SELECT DISTINCT
EmailAddressID,
EmailAddress
FROM
Person.EmailAddress;
WITH CTE AS (
SELECT
EmailAddressID,
EmailAddress,
ROW_NUMBER() OVER (PARTITION BY EmailAddress ORDER BY EmailAddressID) AS RowNum
FROM
Person.EmailAddress
)
DELETE FROM CTE
WHERE RowNum > 1;
2. Handling Missing Values
Missing values can impact analysis and decision-making. Use SQL functions like IS NULL, COALESCE, or CASE to identify and handle missing data.
Example. Replace missing PhoneNumber values in the Person.Use the personPhone table with a default value or remove rows with missing values.
USE AdventureWorks2022;
GO
SELECT
BusinessEntityID,
PhoneNumber,
COALESCE(PhoneNumber, 'Unknown') AS CleanedPhoneNumber
FROM
Person.PersonPhone;
DELETE FROM
Person.PersonPhone
WHERE
PhoneNumber IS NULL;
3. Correcting Data Inconsistencies
Inconsistencies like variations in case or formatting can lead to errors in joins or grouping. SQL Server string functions like UPPER, LOWER, or REPLACE can help standardize data.
Example. Standardize FirstName values in the Person.Person table to uppercase. Replace incorrect substrings in email addresses.
USE AdventureWorks2022;
GO
UPDATE Person.Person
SET FirstName = UPPER(FirstName);
UPDATE Person.EmailAddress
SET EmailAddress = REPLACE(EmailAddress, '@adventure-works.com', '@adventureworks.com');
4. Standardizing Data Formats
Standardized data formats ensure consistency and compatibility across systems. Functions like CAST and CONVERT are often used for this purpose.
Example. Convert ModifiedDate in the Sales.SalesOrderHeader table to a specific format. Here, the date format 101 converts the date to MM/DD/YYYY.
USE AdventureWorks2022
GO
SELECT SalesOrderID,
CONVERT(VARCHAR(10), ModifiedDate, 101) AS FormattedDate
FROM Sales.SalesOrderHeader;
5. Removing Outliers
Outliers can distort statistical analyses and trends. Use statistical functions and filtering to identify and exclude them.
Example. Remove outliers based on TotalDue in the Sales.SalesOrderHeader table.
USE AdventureWorks2022
GO
SELECT *
FROM Sales.SalesOrderHeader
WHERE TotalDue BETWEEN
(SELECT PERCENTILE_CONT(0.05) WITHIN GROUP (ORDER BY TotalDue) OVER ()) AND
(SELECT PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY TotalDue) OVER ());
6. Validating Data
Data validation ensures data integrity by applying constraints or rules. SQL Server constraints like NOT NULL, UNIQUE, and CHECK are essential for enforcing data quality.
Example. Enforce data integrity when creating a new table.
USE AdventureWorks2022;
GO
CREATE TABLE Sales.Promotions (
PromotionID INT PRIMARY KEY,
PromotionName NVARCHAR(100) NOT NULL,
DiscountPercentage DECIMAL(5, 2) CHECK (DiscountPercentage BETWEEN 0 AND 100),
StartDate DATE NOT NULL,
EndDate DATE NOT NULL,
CONSTRAINT CK_Promotions_EndDate CHECK (EndDate > StartDate)
);
Alternatively, validate existing data using conditional queries.
SELECT *
FROM Sales.Promotions
WHERE DiscountPercentage < 0
OR DiscountPercentage > 100;
Conclusion
Data cleansing is an ongoing process and a crucial component of the data lifecycle. By removing duplicates, handling missing values, correcting inconsistencies, standardizing formats, removing outliers, and validating data, you can significantly improve the quality of your data. These techniques, demonstrated using the AdventureWorks2022 database, can be applied to real-world datasets to ensure accurate and actionable insights. By incorporating these practices into your data workflows, you can ensure that your analysis, reports, and machine learning models are built on a solid foundation of clean data.