Comment your MySQL schema
Many people may not now, but you can comment your MySQL schema:

Many people may not now, but you can comment your MySQL schema:
Maatkit is a suite of command-line tools for MySQL. It contains some rather nifty things for query analyses, replication, and other stuff. Some of the more interesting highlights:
Found via databasejournal.com, which has two articles on Maatkit:
The Wonders of Maatkit for MySQL and
Even more Maatkit for MySQL.
I finally found a decent replacement for the MySQLcc database browser:
SQuirreL SQL Client is a graphical Java program that will allow you to view the structure of a JDBC compliant database, browse the data in tables, issue SQL commands etc
It's Java, so it's slow, but it does everything I want, and more:
It has tons of options you can tweak, and it's got plugins if you want to extend it. It supports just about every relational (and some non-relational) database out there.
Awesome.
SQL's aggregate functions (COUNT, MIN, MAX, etc) in combination with GROUP BY is great for generating statistics from a database. It's quite easy to retrieve a count of the rows in a database grouped by week number or month. When there are no data points in the table for a particular week or month, however, you will see gaps in the statistics. Take this example:
We create a table that will have an entry for every download done from a site. Also stored is whether the download was done by a registered user or not:
CREATE TABLE downloads ( id integer primary key auto_increment, date datetime, registered BOOLEAN ); INSERT INTO downloads VALUES (NULL, '2009-01-01', FALSE); INSERT INTO downloads VALUES (NULL, '2009-01-01', TRUE); INSERT INTO downloads VALUES (NULL, '2009-01-01', FALSE); INSERT INTO downloads VALUES (NULL, '2009-01-02', FALSE); INSERT INTO downloads VALUES (NULL, '2009-01-02', FALSE); INSERT INTO downloads VALUES (NULL, '2009-01-03', TRUE); INSERT INTO downloads VALUES (NULL, '2009-01-05', FALSE); INSERT INTO downloads VALUES (NULL, '2009-01-05', FALSE);
The table data shows us there were three downloads on the first of January, two on the second, one on the third, etc.
Now we can gather the total number of downloads per day with the following aggregation query:
SELECT DATE(date) AS day, COUNT(id) AS downloads FROM downloads GROUP BY DAY(date); +------------+-----------+ | day | downloads | +------------+-----------+ | 2009-01-01 | 3 | | 2009-01-02 | 2 | | 2009-01-03 | 1 | | 2009-01-05 | 2 | +------------+-----------+
As you can see from the results, there are no downloads for the fourth of January, and this gap is reflected in the aggregated result. So if you wanted to use this data to generate a chart, you'd have to fill in the gaps somehow. Doing this using a script isn't that hard, but that has the disadvantage of having create a new script (not to mention a new data table) which then needs to run periodically. It can also quickly become a pain in the ass when data also needs to be presented grouped by week, month, year, etc.
A simple solution to this is to create a filler table that contains zero-valued data (actually no data at all) for every data point you'd want to show. In our case we'll need a table with an entry for every day that our downloads span:
CREATE TABLE dates (
date datetime
);
INSERT INTO dates VALUES ('2009-01-01');
INSERT INTO dates VALUES ('2009-01-02');
INSERT INTO dates VALUES ('2009-01-03');
INSERT INTO dates VALUES ('2009-01-04');
INSERT INTO dates VALUES ('2009-01-05');
[etcetera]
INSERT INTO dates VALUES ('2009-01-14');
INSERT INTO dates VALUES ('2009-01-15');
We can now perform a LEFT JOIN of the downloads table on our dates table, and we can be sure that no dates will be missing. Any dates for which there is no entry in the downloads table will get a row filled with NULLs. We can filter the dates we which to see by putting a WHERE clause on our dates table.
As you can see, the gaps are nicely filled with 0 values now.
Care must be taken when filtering data from our table with the actual table. Suppose we want to see the number of downloads per day, but only those who were downloaded by registered persons. Normally, we'd do:
SELECT DATE(date) AS day, COUNT(id) AS downloads FROM downloads WHERE downloads.registered = 0 AND downloads.date BETWEEN '2009-01-01' AND '2009-01-06' GROUP BY DAY(date) ; +------------+-----------+ | day | downloads | +------------+-----------+ | 2009-01-01 | 2 | | 2009-01-02 | 2 | | 2009-01-05 | 2 | +------------+-----------+But with our new filler table, we can't do that, or the empty days will get filtered out again. So instead, we must put the filters not in the WHERE clause, but in the LEFT JOIN clause like so:
SELECT DATE(dates.date) AS day, COUNT(downloads.id) AS downloads FROM dates LEFT JOIN downloads ON downloads.date = dates.date AND downloads.registered = 0 WHERE dates.date BETWEEN '2009-01-01' AND '2009-01-06' GROUP BY DAY(dates.date) +------------+-----------+ | day | downloads | +------------+-----------+ | 2009-01-01 | 2 | | 2009-01-02 | 2 | | 2009-01-03 | 0 | | 2009-01-04 | 0 | | 2009-01-05 | 2 | | 2009-01-06 | 0 | +------------+-----------+And presto! We have gap-fillers once again. Now it's easy to change the way our data is grouped. For example, if you want to group by week instead of day:
SELECT WEEK(dates.date, 3) AS week, COUNT(downloads.id) AS downloads FROM dates LEFT JOIN downloads ON downloads.date = dates.date AND downloads.registered = 0 WHERE WEEK(dates.date, 3) BETWEEN 1 AND 3 GROUP BY WEEK(dates.date, 3) +--------+-----------+ | week | downloads | +--------+-----------+ | 1 | 4 | | 2 | 2 | | 3 | 0 | +--------+-----------+(Note: week 1 ends on Saturday January 3)
UPDATE:: Make sure you always perform the COUNT() on a field from the actual data (the table you're LEFT JOINing), or you will get counts of 1 for data that actually has no rows!
The other day I had to gather some statistics from a database, and the statistics needed to be grouped and filtered by weeknumbers. I ran into something a bit unexpected.
Let's say we have the follow table definition:
CREATE TABLE test ( start DATETIME NOT NULL );
We insert the following dates for testing purposes:
INSERT INTO test VALUES ('2008-12-27');
INSERT INTO test VALUES ('2008-12-28');
INSERT INTO test VALUES ('2008-12-29');
INSERT INTO test VALUES ('2008-12-30');
INSERT INTO test VALUES ('2008-12-31');
INSERT INTO test VALUES ('2009-01-01');
INSERT INTO test VALUES ('2009-01-02');
INSERT INTO test VALUES ('2009-01-03');
INSERT INTO test VALUES ('2009-01-04');
INSERT INTO test VALUES ('2009-01-05');
Those dates span the last week of last year, and the first week of the new year. Now, let's see what happens when we select the weeknumber from this data using MySQL's WEEK() function:
mysql> SELECT start, WEEK(start) FROM test; +---------------------+-------------+ | start | WEEK(start) | +---------------------+-------------+ | 2008-12-27 00:00:00 | 51 | | 2008-12-28 00:00:00 | 52 | | 2008-12-29 00:00:00 | 52 | | 2008-12-30 00:00:00 | 52 | | 2008-12-31 00:00:00 | 52 | | 2009-01-01 00:00:00 | 0 | | 2009-01-02 00:00:00 | 0 | | 2009-01-03 00:00:00 | 0 | | 2009-01-04 00:00:00 | 1 | | 2009-01-05 00:00:00 | 1 | +---------------------+-------------+ 10 rows in set (0.00 sec)
As you can see, we get four different weeks for a timespam of only ten days! Apparently, MySQL counts the first days of the year that do not belong to week 1 as week 0. This was certainly not what I expected, as I'm used to calendars that display the last days of the previous year and the first days of the new year as week 1.
MySQL's default WEEK() function could have caused serious data skew for me in this case. Fortunately, using the WEEK() function was too slow (as it had to calculate the weeknumber for each row in the result due to a WHERE WEEK(column) BETWEEN x AND y clause in my query), so we calculated the weeknumber using Unix timestamps ourselves. That's when we found the error. When using the WHERE clause mentioned, and the above data, we could have gotten:
mysql> SELECT * FROM test WHERE WEEK(start) BETWEEN 1 AND 52; +---------------------+ | start | +---------------------+ | 2008-12-27 00:00:00 | | 2008-12-28 00:00:00 | | 2008-12-29 00:00:00 | | 2008-12-30 00:00:00 | | 2008-12-31 00:00:00 | | 2009-01-04 00:00:00 | | 2009-01-05 00:00:00 | +---------------------+ 7 rows in set (0.00 sec)
Only 7 rows are returned, while we should have gotten 10.
It turns out that MySQL's WEEK() function can operate in eight different ways. From the manual:
WEEK(date[,mode])
This function returns the week number for date. The two-argument form of WEEK() allows you to specify whether the week starts on Sunday or Monday and whether the return value should be in the range from 0 to 53 or from 1 to 53. If the mode argument is omitted, the value of the default_week_format system variable is used.
The following table describes how the mode argument works.
+--------------------------------------------------------------------+ | Mode | First day of week | Range | Week 1 is the first week ... | |------+-------------------+-------+---------------------------------| | 0 | Sunday | 0-53 | with a Sunday in this year | |------+-------------------+-------+---------------------------------| | 1 | Monday | 0-53 | with more than 3 days this year | |------+-------------------+-------+---------------------------------| | 2 | Sunday | 1-53 | with a Sunday in this year | |------+-------------------+-------+---------------------------------| | 3 | Monday | 1-53 | with more than 3 days this year | |------+-------------------+-------+---------------------------------| | 4 | Sunday | 0-53 | with more than 3 days this year | |------+-------------------+-------+---------------------------------| | 5 | Monday | 0-53 | with a Monday in this year | |------+-------------------+-------+---------------------------------| | 6 | Sunday | 1-53 | with more than 3 days this year | |------+-------------------+-------+---------------------------------| | 7 | Monday | 1-53 | with a Monday in this year | +--------------------------------------------------------------------+
So what we really wanted was the WEEK(column, 3); mode:
mysql> SELECT start, WEEK(start, 3) FROM test; +---------------------+----------------+ | start | WEEK(start, 3) | +---------------------+----------------+ | 2008-12-27 00:00:00 | 52 | | 2008-12-28 00:00:00 | 52 | | 2008-12-29 00:00:00 | 1 | | 2008-12-30 00:00:00 | 1 | | 2008-12-31 00:00:00 | 1 | | 2009-01-01 00:00:00 | 1 | | 2009-01-02 00:00:00 | 1 | | 2009-01-03 00:00:00 | 1 | | 2009-01-04 00:00:00 | 1 | | 2009-01-05 00:00:00 | 2 | +---------------------+----------------+ 10 rows in set (0.00 sec)
So, take care when using MySQL's WEEK() function, and always make sure to read the manual on all the functions you use at least once.
I've written a tool called MyQryReplayer:
MyQryReplayer is a tool which can read the MySQL query log and replay an entire session's worth of queries against a database (SELECT queries only by default). While doing so, it records the time each query took to run, and any queries that failed including their error messages. MyQryReplayer can be used to inspect query performance, and to check a log of queries against a database for possible errors (when upgrading to a new version of MySQL for example).
Get version 0.1 here.
In the post SQL: Find duplicate rows I explained how you can find duplicate rows in a table. However, if you want to delete them, you'll run into a little problem as you have to delete all the duplicate rows except for one. Lots of solutions are floating around on the Internet, but most either don't work at all or don't work with MySQL.
Temporary table
One solution that works with MySQL is to create a temporary table to hold the data without duplicates and using the SELECT described in the post mentioned above (without the HAVING clause) and then dropping the original table and recreating it.
Suppose we have a table 'Foo' with three columns:
Foo: id int(8) auto_increment field1 int(8) field2 int(8)
'field1' and 'field2' don't have a unique constraint, so they might contain duplicates. To remove these duplicates, we do:
CREATE TEMPORARY TABLE tmp SELECT min(id) as min, field1, field2 FROM Foo GROUP BY Foo.field1, Foo.field2; DROP TABLE Foo; CREATE TABLE Foo SELECT * FROM tmp;
This will fill the 'tmp' table with unique rows from table 'Foo', and uses the lowest 'id' value for each row. So if we have two rows:
id field1 field2 ------------------ 1 hi bye 2 hey see ya 3 hello goodbye 4 hi bye
We will end up with:
id field1 field2 ------------------ 1 hi bye 2 hey see ya 3 hello goodbye
In place
The above solution works well, but for some reason I don't like it. I don't like making temporary tables. I want to do it with a single query, modifying the table in place. You might call it pigheaded, as the solution presented below doesn't work as well on large datasets as the one above, but still, I found a way.
Another way to remove duplicates in MySQL is by using the following query:
DELETE Foo FROM Foo, Foo t2 WHERE Foo.field1=t2.field1 AND Foo.field2=t2.field2 AND Foo.id < t2.id
There we go. Explanation: This first does a Cartesion Join on every row where the two fields 'field1' and 'field2' are the same. So if we start with this data:
1 A B <- duplicate 2 A B <- duplicate 3 A C
we get the Cartesian product (all possible combinations) for the 'id' column for every unique set of the 'field1' and 'field2' columns:
1 A B 1 A B 1 A B 2 A B 2 A B 1 A B 2 A B 2 A B 3 A C 3 A C
Next we say 'AND Foo.id < t2.id'. This will leave us with following row, which is the only one where the left id is smaller than the right id:
1 A B 2 A B
It then DELETE's that row from Foo (that is, the row '1 A B' - remember that the two rows with left id '2' are actually one and the same row), leaving:
2 A B 3 A C
There we go. If you want to keep the smallest 'id' for each unique set of 'field1' and 'field2', simple use 'AND Foo.id > t2.id'. It works the same way.
Remember that this is incredibly slow for large tables due to the Cartesion Join, unless you've got indexes on 'field1' and 'field2' and don't have a lot of duplicates. Otherwise, use the temporary table solution above.
Something every PHP developer should be reading:
The Unexpected SQL Injection – When Escaping Is Not Enough
The conclusions:
Like my rule #1 of what I like to call Defensive Coding: Don't be implicit, be explicit. In other words, don't try to escape things you don't want in your strings, simply only leave everything you do want in your strings. A column name in a ORDER BY clause should only consist of A-Z, a-z and 0-9. Anything else in the string invalidates that string.
Today, I was quite mystified by a very slow running query. There was a table named 'bar' with about 3000 rows. I wanted to list all the rows that had a duplicate value for a certain field ('foo'), and only those rows. The solution was to build a query that selected the rows where the value of the 'foo' field was in the results of a subquery that selected 'foo' for duplicate values of foo. The query finally looked something like this:
SELECT foo FROM bar WHERE foo IN ( SELECT foo FROM bar GROUP BY foo HAVING COUNT(foo) > 1 ) |
The inner query (explained in this post) was very fast, and returned only two rows. The outer query, when I ran it like this:
SELECT foo FROM bar WHERE foo IN (1, 2); |
also ran very fast. However the combination of the two was extremely slow. I thought this was weird, since there were only two results in the inner query. A colleague of mine and me took a look at the EXPLAIN of the query, and found out it was actually doing a full join of 3000×3000 rows. The use of HAVING threw me off because it appeared in the inner join. But HAVING is always applied very late in the execution process, just before the results are sent to the client. This means MySQL doesn't even look at the HAVING to optimize queries. From the manual:
The HAVING clause is applied nearly last, just before items are sent to the client, with no optimization.
Putting an index on the 'foo' column solved the speed problem, though it's still not as fast as it could be because it's still doing a JOIN on 'foo' with itself, only this time only doing 2×3000 rows.
I've been using Vim for years now, but there's still new stuff to learn. Check out this page for the PDF version of the slides of a talk given by Andrei Zmievski on editing PHP with Vim. His configuration files are also available.
Here's my favourite list of tips:
Add the following text to your ~/.vim/ftplugin/php.vim file:
set formatoptions+=tcqlro let php_sql_query=1 let php_htmlInStrings=1 let php_folding = 1
This will: