SQL: Remove duplicate rows

Tuesday, January 22nd, 2008

In the post SQL: Find duplicate rows I explained how you can find duplicate rows in a table. However, if you want to delete them, you’ll run into a little problem as you have to delete all the duplicate rows except for one. Lots of solutions are floating around on the Internet, but most either don’t work at all or don’t work with MySQL.

Temporary table

One solution that works with MySQL is to create a temporary table to hold the data without duplicates and using the SELECT described in the post mentioned above (without the HAVING clause) and then dropping the original table and recreating it.

Suppose we have a table ‘Foo’ with three columns:

Foo:
id int(8) auto_increment
field1 int(8)
field2 int(8)

‘field1’ and ‘field2’ don’t have a unique constraint, so they might contain duplicates. To remove these duplicates, we do:

CREATE TEMPORARY TABLE tmp
  SELECT min(id) as min, field1, field2
  FROM Foo
  GROUP BY Foo.field1, Foo.field2;

DROP TABLE Foo;

CREATE TABLE Foo
  SELECT * FROM tmp;

This will fill the ‘tmp’ table with unique rows from table ‘Foo’, and uses the lowest ‘id’ value for each row. So if we have two rows:

id  field1  field2
------------------
1   hi      bye 
2   hey     see ya
3   hello   goodbye
4   hi      bye

We will end up with:

id  field1  field2
------------------
1   hi      bye 
2   hey     see ya
3   hello   goodbye

In place

The above solution works well, but for some reason I don’t like it. I don’t like making temporary tables. I want to do it with a single query, modifying the table in place. You might call it pigheaded, as the solution presented below doesn’t work as well on large datasets as the one above, but still, I found a way.

Another way to remove duplicates in MySQL is by using the following query:

DELETE 
  Foo 
FROM 
  Foo, Foo t2 
WHERE 
  Foo.field1=t2.field1 AND 
  Foo.field2=t2.field2 AND 
  Foo.id < t2.id

There we go. Explanation: This first does a Cartesion Join on every row where the two fields 'field1' and 'field2' are the same. So if we start with this data:

1 A B  <- duplicate
2 A B  <- duplicate
3 A C

we get the Cartesian product (all possible combinations) for the 'id' column for every unique set of the 'field1' and 'field2' columns:

1 A B 1 A B
1 A B 2 A B
2 A B 1 A B
2 A B 2 A B
3 A C 3 A C

Next we say 'AND Foo.id < t2.id'. This will leave us with following row, which is the only one where the left id is smaller than the right id:

1 A B 2 A B

It then DELETE's that row from Foo (that is, the row '1 A B' - remember that the two rows with left id '2' are actually one and the same row), leaving:

2 A B
3 A C

There we go. If you want to keep the smallest 'id' for each unique set of 'field1' and 'field2', simple use 'AND Foo.id > t2.id'. It works the same way.

Remember that this is incredibly slow for large tables due to the Cartesion Join, unless you've got indexes on 'field1' and 'field2' and don't have a lot of duplicates. Otherwise, use the temporary table solution above.

Blog

SQL: Remove duplicate rows