Finding Duplicates Among Multiple Columns in SQL

Hemant Srivastava
3y
195k
0
4

Article

Introduction

Here I show how to find duplicates and their frequency among multiple columns using the GROUP BY clause in SQL.

Suppose you have data in which you need to find the duplicates in a single column (for example to find common names in a list of names) or in multiple columns (for example to find all the persons who have the same name and age but reside at a different address). In these situations, we could do that easily by applying "GROUP BY" and "HAVING" clauses. Let us see how to use these SQL clauses.

Finding Duplicates in a single column

Let us consider we need to find the street numbers that are repeated in the preceding table. From the preceding table, we can see that street numbers: 604 and 538 are the repeated ones.

SELECT STREET_NUM
FROM ADDRESS_TABLE
GROUP BY STREET_NUM
HAVING COUNT(*) > 1

Output

Finding Duplicates combinations spanning multiple columns

Let us consider we need to find the addresses that are identical except only by their City. For this, we select the collection of such fields for which we seek the duplicates. In this case, we form a collection having all the address fields except City because we want identical addresses that differ only by their city field. So the collection will be.

"STREET_NUM,PREFIX,STREET_NAME,SUFFIX,STREET_TYPE,STATE,POSTAL_CODE,COUNTRY"

SELECT
      STREET_NUM,
      PREFIX,
      STREET_NAME,
      SUFFIX,
      STREET_TYPE,
      STATE,

POSTAL_CODE,

COUNTRY
FROM ADDRESS_TABLE

GROUP BY
      STREET_NUM,
      PREFIX,
      STREET_NAME,
      SUFFIX,
      STREET_TYPE,
      STATE,

POSTAL_CODE,

COUNTRY
HAVING COUNT(*) > 1

The above query will give the output.

Note

The fields in the SELECT clause must be matched with fields in the GROUP BY clause otherwise we will get an error like "Column 'ADDRESS_TABLE.CITY' is invalid in the select list because it is not contained in either an aggregate function or the GROUP BY clause." for the following SQL query.

SELECT STREET_NUM, PREFIX,STREET_NAME,SUFFIX,STREET_TYPE,CITY,STATE, POSTAL_CODE,COUNTRY

FROM ADDRESS_TABLE
GROUP BY STREET_NUM, PREFIX,STREET_NAME,SUFFIX,STREET_TYPE,STATE, POSTAL_CODE,COUNTRY

HAVING COUNT(*) > 1

Finding Duplicates combinations spanning in multiple columns and their frequency Let us consider we need to find the addresses that are identical except only by their City and as well as their frequency. In the following query, we just add COUNT(*) which gives the count of the group of columns that we put in the GROUP BY clause.

SELECT COUNT(*)AS DUPLICATES_COUNT,

      STREET_NUM,
      PREFIX,
      STREET_NAME,
      SUFFIX,
      STREET_TYPE,
      STATE,

POSTAL_CODE,

COUNTRY
FROM ADDRESS_TABLE

GROUP BY

      STREET_NUM,
      PREFIX,
      STREET_NAME,
      SUFFIX,
      STREET_TYPE,
      STATE,

POSTAL_CODE,

COUNTRY
HAVING COUNT(*) > 1

The above query will give the output.

So, we summarize the following points

We can find the occurrence of one/many fields in the data if we group them by the GROUP BY clause and qualify them by the HAVING clause further.
The fields in the SELECT clause must be matched with fields in the GROUP BY clause.

Conclusion

In this article, we learned about how to find Duplicates Among Multiple Columns in SQL.