Problem Statement
We are sometimes in the situation where our database table neither has a primary key nor is unique and the table contains duplicate data. Now we want to delete this duplicate data from this database table.
Example
Suppose I have a table named #DeleteData. The definition of the table and a test data script is given below.
CREATE TABLE #DeleteData
(
Id INT,
Name VARCHAR(20)
)
INSERT INTO #DeleteData VALUES (1,'Jignesh')
INSERT INTO #DeleteData VALUES (1,'Jignesh')
INSERT INTO #DeleteData VALUES (1,'Jignesh')
INSERT INTO #DeleteData VALUES (2,'Tejas')
INSERT INTO #DeleteData VALUES (2,'Tejas')
INSERT INTO #DeleteData VALUES (3,'Rakesh')
This table contains duplicate data.
Now I want to delete duplicate data from the database table.
Solution
There are many ways to delete the duplicate data from the database table. In this article I will explain it one by one.
1. Using CTE (Common Table Expression)
Using a Common Table Expression, we can remove the duplicate record. We can generate CTE with one additional column that says the Row Number is nothing but a serial number that is partitioned by the id. After creating the CTE we can fire a DELETE statement with a row number greater than 1.
;WITH duplicateData AS
(
SELECT id,name, ROW_NUMBER() OVER(PARTITION BY id ORDER BY id) AS rowno FROM #DeleteData
)
DELETE duplicateData WHERE rowno >1
2. Using Temporary Table
This is the very simplest way to delete a duplicate record from the table. In this method, I am inserting distinct data into the temporary table, truncating the original table and re-inserting the data into the original table from the temporary table.
SELECT * INTO #temp FROM #DeleteData
TRUNCATE TABLE #DeleteData
INSERT INTO #DeleteData
SELECT ID, NAME FROM #temp
GROUP BY ID, NAME
3. By Adding Identity Column
In this method, I temporarily added an identity column in the database table. This will generate a sequential number for each row of the database table. Now I keep the lowest row number for each duplicate record and delete all other records.
ALTER TABLE #DeleteData ADD rowNo int identity(1,1)
DELETE #DeleteData
WHERE rowno not in(select min(rowno) from #DeleteData group by id,Name)
ALTER TABLE #DeleteData DROP COLUMN rowNo
4. Using Cursor
The cursor is used to manipulate data in a set on a row-by-row basis. Here the basic idea is to take a unique record from the table, keep it and delete all other records. Using the Group by clause, we can find a unique row of the table and we can perform the loop on this unique record using a cursor. Within the loop we can keep the first and delete all other records.
DECLARE @id int, @Name VARCHAR(20), @Count INT
DECLARE DuplicateCursor CURSOR
FOR SELECT id, Name, Count(1) as c FROM #DeleteData GROUP BY id, Name
OPEN DuplicateCursor
FETCH NEXT FROM DuplicateCursor INTO @id, @Name, @Count
WHILE @@FETCH_STATUS = 0
BEGIN
DELETE TOP (@Count-1) FROM #DeleteData
WHERE id = @id AND Name = @Name
FETCH NEXT FROM DuplicateCursor INTO @id, @Name, @Count
END
5. Using %%physloc%%
%%physloc%% is an undocumented and unsupported feature in SQL Server 2008. This is the same as
%%lockres%% in SQL Server 2005. This is a mechanism for identifying the physical address of a row.
Here the basic idea is to keep the smallest physical address row and delete all other rows.
DELETE #DeleteData
WHERE #DeleteData.%%physloc%%
NOT IN (SELECT MIN(b.%%physloc%%)
FROM #DeleteData b
GROUP BY b.Id, b.Name);
Final output
Conclusion
We can remove duplicate data (rows) from SQL Server using the above described methods.