Electricmonk

Ferry Boender

Programmer, DevOpper, Open Source enthusiast.

Blog

SQL: Find duplicate rows

Tuesday, July 24th, 2007

Amazingly enough, I’ve never had the need to find duplicate rows in a table before. But in this case, I somehow forgot to add a key to a particular table in a database and, naturally, duplicates where accidentally inserted. Of course you’ll want to put a key on the table, but first you’ll have to remove the duplicate rows. As is usually the case, it’s pretty easy using SQL.

Suppose you have the following data in a table:

firstname    lastname    Age
----------------------------
Jane         Doe         21
John         Doe         25
John         Smith       43
Jane         Doe         55
Jane         Smith       22
John         Doe         19
Jane         Doe         21

The superkey in the above table is (for example) <firstname, lastname>. To find duplicates for this superkey, you can use the following query:

SELECT 
	COUNT(*), 
	firstname, 
	lastname
FROM Address
GROUP BY firstname, lastname
HAVING COUNT(*) > 1

This will give the following results:

COUNT(*)    firstname    lastname
---------------------------------
2           John         Doe
3           Jane         Doe

This means the combination ‘John’ + ‘Doe’ appears two times and ‘Jane’ + ‘Doe’ appears thrice. People with a little SQL experience will find this straightforward, but I’ll explain anyway:

We select a “COUNT(*)” in order to count all the rows, plus we select firstname and lastname fields so we can see which are actually the duplicates. We use the fields firstname and lastname on the GROUP BY clause because that’s our superkey. GROUP BY will tell aggregate functions (COUNT(), SUM(), etc) in the SELECT clause that it should count grouped by certain fields, not just all the results returned. Basically, it counts all the unique values of a concatenation of firstname + lastname. The HAVING COUNT(*) > 1 is purely to weed out stuff that’s not duplicate.

I’m not quite sure how to automatically delete duplicates while leaving one in place though ;-)

Update: I do now! Here’s how to delete duplicates in place: SQL: Remove duplicte rows

The text of all posts on this blog, unless specificly mentioned otherwise, are licensed under this license.