SQL: Remove duplicate rows
Tuesday, January 22nd, 2008
In the post SQL: Find duplicate rows I explained how you can find duplicate rows in a table. However, if you want to delete them, you’ll run into a little problem as you have to delete all the duplicate rows except for one. Lots of solutions are floating around on the Internet, but most either don’t work at all or don’t work with MySQL.
One solution that works with MySQL is to create a temporary table to hold the data without duplicates and using the SELECT described in the post mentioned above (without the HAVING clause) and then dropping the original table and recreating it.
Suppose we have a table ‘Foo’ with three columns:
Foo: id int(8) auto_increment field1 int(8) field2 int(8)
‘field1’ and ‘field2’ don’t have a unique constraint, so they might contain duplicates. To remove these duplicates, we do:
CREATE TEMPORARY TABLE tmp SELECT min(id) as min, field1, field2 FROM Foo GROUP BY Foo.field1, Foo.field2; DROP TABLE Foo; CREATE TABLE Foo SELECT * FROM tmp;
This will fill the ‘tmp’ table with unique rows from table ‘Foo’, and uses the lowest ‘id’ value for each row. So if we have two rows:
id field1 field2 ------------------ 1 hi bye 2 hey see ya 3 hello goodbye 4 hi bye
We will end up with:
id field1 field2 ------------------ 1 hi bye 2 hey see ya 3 hello goodbye
The above solution works well, but for some reason I don’t like it. I don’t like making temporary tables. I want to do it with a single query, modifying the table in place. You might call it pigheaded, as the solution presented below doesn’t work as well on large datasets as the one above, but still, I found a way.
Another way to remove duplicates in MySQL is by using the following query:
DELETE Foo FROM Foo, Foo t2 WHERE Foo.field1=t2.field1 AND Foo.field2=t2.field2 AND Foo.id < t2.id
There we go. Explanation: This first does a Cartesion Join on every row where the two fields 'field1' and 'field2' are the same. So if we start with this data:
1 A B <- duplicate 2 A B <- duplicate 3 A C
we get the Cartesian product (all possible combinations) for the 'id' column for every unique set of the 'field1' and 'field2' columns:
1 A B 1 A B 1 A B 2 A B 2 A B 1 A B 2 A B 2 A B 3 A C 3 A C
Next we say 'AND Foo.id < t2.id'. This will leave us with following row, which is the only one where the left id is smaller than the right id:
1 A B 2 A B
It then DELETE's that row from Foo (that is, the row '1 A B' - remember that the two rows with left id '2' are actually one and the same row), leaving:
2 A B 3 A C
There we go. If you want to keep the smallest 'id' for each unique set of 'field1' and 'field2', simple use 'AND Foo.id > t2.id'. It works the same way.
Remember that this is incredibly slow for large tables due to the Cartesion Join, unless you've got indexes on 'field1' and 'field2' and don't have a lot of duplicates. Otherwise, use the temporary table solution above.