Data Cleansing in SQL Server

Data cleansing is a vital step in the data preparation process to ensure that the data used for analysis, reporting, or machine learning is accurate, consistent, and reliable. Poor-quality data can lead to incorrect conclusions, flawed models, and inefficient decision-making. This article explores common data cleansing techniques in SQL Server with examples derived from the AdventureWorks2022 database.

1. Removing Duplicates

Duplicates in datasets can skew results, inflate counts, or cause redundancy. SQL Server provides ways to remove duplicates using DISTINCT or GROUP BY.

Example. Identify and remove duplicate rows in the Person.EmailAddress table.

If you want to remove duplicates permanently, you can use CTE (Common Table Expressions) with ROW_NUMBER().

USE AdventureWorks2022;
GO

SELECT DISTINCT 
    EmailAddressID, 
    EmailAddress
FROM 
    Person.EmailAddress;

WITH CTE AS (
    SELECT 
        EmailAddressID, 
        EmailAddress,
        ROW_NUMBER() OVER (PARTITION BY EmailAddress ORDER BY EmailAddressID) AS RowNum
    FROM 
        Person.EmailAddress
)
DELETE FROM CTE 
WHERE RowNum > 1;

2. Handling Missing Values

Missing values can impact analysis and decision-making. Use SQL functions like IS NULL, COALESCE, or CASE to identify and handle missing data.

Example. Replace missing PhoneNumber values in the Person.Use the personPhone table with a default value or remove rows with missing values.

USE AdventureWorks2022;
GO

SELECT 
    BusinessEntityID, 
    PhoneNumber,
    COALESCE(PhoneNumber, 'Unknown') AS CleanedPhoneNumber
FROM 
    Person.PersonPhone;

DELETE FROM 
    Person.PersonPhone
WHERE 
    PhoneNumber IS NULL;

3. Correcting Data Inconsistencies

Inconsistencies like variations in case or formatting can lead to errors in joins or grouping. SQL Server string functions like UPPER, LOWER, or REPLACE can help standardize data.

Example. Standardize FirstName values in the Person.Person table to uppercase. Replace incorrect substrings in email addresses.

USE AdventureWorks2022;
GO

UPDATE Person.Person
SET FirstName = UPPER(FirstName);

UPDATE Person.EmailAddress
SET EmailAddress = REPLACE(EmailAddress, '@adventure-works.com', '@adventureworks.com');

4. Standardizing Data Formats

Standardized data formats ensure consistency and compatibility across systems. Functions like CAST and CONVERT are often used for this purpose.

Example. Convert ModifiedDate in the Sales.SalesOrderHeader table to a specific format. Here, the date format 101 converts the date to MM/DD/YYYY.

USE AdventureWorks2022
GO

SELECT SalesOrderID, 
       CONVERT(VARCHAR(10), ModifiedDate, 101) AS FormattedDate
FROM Sales.SalesOrderHeader;

5. Removing Outliers

Outliers can distort statistical analyses and trends. Use statistical functions and filtering to identify and exclude them.

Example. Remove outliers based on TotalDue in the Sales.SalesOrderHeader table.

USE AdventureWorks2022
GO

SELECT *
FROM Sales.SalesOrderHeader
WHERE TotalDue BETWEEN
      (SELECT PERCENTILE_CONT(0.05) WITHIN GROUP (ORDER BY TotalDue) OVER ()) AND
      (SELECT PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY TotalDue) OVER ());

6. Validating Data

Data validation ensures data integrity by applying constraints or rules. SQL Server constraints like NOT NULL, UNIQUE, and CHECK are essential for enforcing data quality.

Example. Enforce data integrity when creating a new table.

USE AdventureWorks2022;
GO

CREATE TABLE Sales.Promotions (
    PromotionID INT PRIMARY KEY,
    PromotionName NVARCHAR(100) NOT NULL,
    DiscountPercentage DECIMAL(5, 2) CHECK (DiscountPercentage BETWEEN 0 AND 100),
    StartDate DATE NOT NULL,
    EndDate DATE NOT NULL,
    CONSTRAINT CK_Promotions_EndDate CHECK (EndDate > StartDate)
);

Alternatively, validate existing data using conditional queries.

SELECT * 
FROM Sales.Promotions
WHERE DiscountPercentage < 0 
   OR DiscountPercentage > 100;

Conclusion

Data cleansing is an ongoing process and a crucial component of the data lifecycle. By removing duplicates, handling missing values, correcting inconsistencies, standardizing formats, removing outliers, and validating data, you can significantly improve the quality of your data. These techniques, demonstrated using the AdventureWorks2022 database, can be applied to real-world datasets to ensure accurate and actionable insights. By incorporating these practices into your data workflows, you can ensure that your analysis, reports, and machine learning models are built on a solid foundation of clean data.


Similar Articles