contact
----------------------------

Blog <-

Archive for the ‘sql’ Category

RSS   RSS feed for this category

Comment your MySQL schema

Many people may not now, but you can comment your MySQL schema:

SQL: good comments conventions

Maatkit: Tools for MySQL

Maatkit is a suite of command-line tools for MySQL. It contains some rather nifty things for query analyses, replication, and other stuff. Some of the more interesting highlights:

Found via databasejournal.com, which has two articles on Maatkit:

The Wonders of Maatkit for MySQL and
Even more Maatkit for MySQL.

SQuirreL SQL database browser

I finally found a decent replacement for the MySQLcc database browser:

SQuirreL SQL

SQuirreL SQL Client is a graphical Java program that will allow you to view the structure of a JDBC compliant database, browse the data in tables, issue SQL commands etc

It's Java, so it's slow, but it does everything I want, and more:

  1. Syntax highlighting
  2. Multiple query tabs
  3. Multiple queries in the same tab (select the query and press ctrl-enter to run it)
  4. Export results

It has tons of options you can tweak, and it's got plugins if you want to extend it. It supports just about every relational (and some non-relational) database out there.

Awesome.

Filling gaps in data when using aggregates

SQL's aggregate functions (COUNT, MIN, MAX, etc) in combination with GROUP BY is great for generating statistics from a database. It's quite easy to retrieve a count of the rows in a database grouped by week number or month. When there are no data points in the table for a particular week or month, however, you will see gaps in the statistics. Take this example:

We create a table that will have an entry for every download done from a site. Also stored is whether the download was done by a registered user or not:

CREATE TABLE downloads (
  id integer primary key auto_increment,
  date datetime,
  registered BOOLEAN
);

INSERT INTO downloads VALUES (NULL, '2009-01-01', FALSE);
INSERT INTO downloads VALUES (NULL, '2009-01-01', TRUE);
INSERT INTO downloads VALUES (NULL, '2009-01-01', FALSE);
INSERT INTO downloads VALUES (NULL, '2009-01-02', FALSE);
INSERT INTO downloads VALUES (NULL, '2009-01-02', FALSE);
INSERT INTO downloads VALUES (NULL, '2009-01-03', TRUE);
INSERT INTO downloads VALUES (NULL, '2009-01-05', FALSE);
INSERT INTO downloads VALUES (NULL, '2009-01-05', FALSE);

The table data shows us there were three downloads on the first of January, two on the second, one on the third, etc.

Now we can gather the total number of downloads per day with the following aggregation query:

SELECT 
  DATE(date) AS day, 
  COUNT(id) AS downloads
FROM downloads 
GROUP BY DAY(date);

+------------+-----------+
| day        | downloads |
+------------+-----------+
| 2009-01-01 |         3 |
| 2009-01-02 |         2 |
| 2009-01-03 |         1 |
| 2009-01-05 |         2 |
+------------+-----------+

As you can see from the results, there are no downloads for the fourth of January, and this gap is reflected in the aggregated result. So if you wanted to use this data to generate a chart, you'd have to fill in the gaps somehow. Doing this using a script isn't that hard, but that has the disadvantage of having create a new script (not to mention a new data table) which then needs to run periodically. It can also quickly become a pain in the ass when data also needs to be presented grouped by week, month, year, etc.

A simple solution to this is to create a filler table that contains zero-valued data (actually no data at all) for every data point you'd want to show. In our case we'll need a table with an entry for every day that our downloads span:

CREATE TABLE dates (
  date datetime
);

INSERT INTO dates VALUES ('2009-01-01');
INSERT INTO dates VALUES ('2009-01-02');
INSERT INTO dates VALUES ('2009-01-03');
INSERT INTO dates VALUES ('2009-01-04');
INSERT INTO dates VALUES ('2009-01-05');
[etcetera]
INSERT INTO dates VALUES ('2009-01-14');
INSERT INTO dates VALUES ('2009-01-15');

We can now perform a LEFT JOIN of the downloads table on our dates table, and we can be sure that no dates will be missing. Any dates for which there is no entry in the downloads table will get a row filled with NULLs. We can filter the dates we which to see by putting a WHERE clause on our dates table.

As you can see, the gaps are nicely filled with 0 values now.

Care must be taken when filtering data from our table with the actual table. Suppose we want to see the number of downloads per day, but only those who were downloaded by registered persons. Normally, we'd do:

SELECT 
  DATE(date) AS day, 
  COUNT(id) AS downloads 
FROM downloads 
WHERE 
  downloads.registered = 0 AND
  downloads.date BETWEEN '2009-01-01' AND '2009-01-06'
GROUP BY DAY(date) ;

+------------+-----------+
| day        | downloads |
+------------+-----------+
| 2009-01-01 |         2 |
| 2009-01-02 |         2 |
| 2009-01-05 |         2 |
+------------+-----------+

But with our new filler table, we can't do that, or the empty days will get filtered out again. So instead, we must put the filters not in the WHERE clause, but in the LEFT JOIN clause like so:

SELECT
  DATE(dates.date) AS day,
  COUNT(downloads.id) AS downloads
FROM dates
LEFT JOIN downloads ON downloads.date = dates.date AND downloads.registered = 0
WHERE 
  dates.date BETWEEN '2009-01-01' AND '2009-01-06'
GROUP BY DAY(dates.date)

+------------+-----------+
| day        | downloads |
+------------+-----------+
| 2009-01-01 |         2 |
| 2009-01-02 |         2 |
| 2009-01-03 |         0 |
| 2009-01-04 |         0 |
| 2009-01-05 |         2 |
| 2009-01-06 |         0 |
+------------+-----------+

And presto! We have gap-fillers once again. Now it's easy to change the way our data is grouped. For example, if you want to group by week instead of day:

SELECT
  WEEK(dates.date, 3) AS week,
  COUNT(downloads.id) AS downloads
FROM dates
LEFT JOIN downloads ON downloads.date = dates.date AND downloads.registered = 0
WHERE WEEK(dates.date, 3) BETWEEN 1 AND 3
GROUP BY WEEK(dates.date, 3)

+--------+-----------+
| week   | downloads |
+--------+-----------+
|      1 |         4 |
|      2 |         2 |
|      3 |         0 |
+--------+-----------+

(Note: week 1 ends on Saturday January 3)

UPDATE:: Make sure you always perform the COUNT() on a field from the actual data (the table you're LEFT JOINing), or you will get counts of 1 for data that actually has no rows!

MySQL WEEK() function weirdness

The other day I had to gather some statistics from a database, and the statistics needed to be grouped and filtered by weeknumbers. I ran into something a bit unexpected.

Let's say we have the follow table definition:

CREATE TABLE test (
  start DATETIME NOT NULL
);

We insert the following dates for testing purposes:

INSERT INTO test VALUES ('2008-12-27');
INSERT INTO test VALUES ('2008-12-28');
INSERT INTO test VALUES ('2008-12-29');
INSERT INTO test VALUES ('2008-12-30');
INSERT INTO test VALUES ('2008-12-31');
INSERT INTO test VALUES ('2009-01-01');
INSERT INTO test VALUES ('2009-01-02');
INSERT INTO test VALUES ('2009-01-03');
INSERT INTO test VALUES ('2009-01-04');
INSERT INTO test VALUES ('2009-01-05');

Those dates span the last week of last year, and the first week of the new year. Now, let's see what happens when we select the weeknumber from this data using MySQL's WEEK() function:

mysql> SELECT start, WEEK(start) FROM test;
+---------------------+-------------+
| start               | WEEK(start) |
+---------------------+-------------+
| 2008-12-27 00:00:00 |          51 | 
| 2008-12-28 00:00:00 |          52 | 
| 2008-12-29 00:00:00 |          52 | 
| 2008-12-30 00:00:00 |          52 | 
| 2008-12-31 00:00:00 |          52 | 
| 2009-01-01 00:00:00 |           0 | 
| 2009-01-02 00:00:00 |           0 | 
| 2009-01-03 00:00:00 |           0 | 
| 2009-01-04 00:00:00 |           1 | 
| 2009-01-05 00:00:00 |           1 | 
+---------------------+-------------+
10 rows in set (0.00 sec)

As you can see, we get four different weeks for a timespam of only ten days! Apparently, MySQL counts the first days of the year that do not belong to week 1 as week 0. This was certainly not what I expected, as I'm used to calendars that display the last days of the previous year and the first days of the new year as week 1.

MySQL's default WEEK() function could have caused serious data skew for me in this case. Fortunately, using the WEEK() function was too slow (as it had to calculate the weeknumber for each row in the result due to a WHERE WEEK(column) BETWEEN x AND y clause in my query), so we calculated the weeknumber using Unix timestamps ourselves. That's when we found the error. When using the WHERE clause mentioned, and the above data, we could have gotten:

mysql> SELECT * FROM test WHERE WEEK(start) BETWEEN 1 AND 52;
+---------------------+
| start               |
+---------------------+
| 2008-12-27 00:00:00 | 
| 2008-12-28 00:00:00 | 
| 2008-12-29 00:00:00 | 
| 2008-12-30 00:00:00 | 
| 2008-12-31 00:00:00 | 
| 2009-01-04 00:00:00 | 
| 2009-01-05 00:00:00 | 
+---------------------+
7 rows in set (0.00 sec)

Only 7 rows are returned, while we should have gotten 10.

It turns out that MySQL's WEEK() function can operate in eight different ways. From the manual:

WEEK(date[,mode])

This function returns the week number for date. The two-argument form of WEEK() allows you to specify whether the week starts on Sunday or Monday and whether the return value should be in the range from 0 to 53 or from 1 to 53. If the mode argument is omitted, the value of the default_week_format system variable is used.

The following table describes how the mode argument works.

+--------------------------------------------------------------------+
| Mode | First day of week | Range | Week 1 is the first week ...    |
|------+-------------------+-------+---------------------------------|
| 0    | Sunday            | 0-53  | with a Sunday in this year      |
|------+-------------------+-------+---------------------------------|
| 1    | Monday            | 0-53  | with more than 3 days this year |
|------+-------------------+-------+---------------------------------|
| 2    | Sunday            | 1-53  | with a Sunday in this year      |
|------+-------------------+-------+---------------------------------|
| 3    | Monday            | 1-53  | with more than 3 days this year |
|------+-------------------+-------+---------------------------------|
| 4    | Sunday            | 0-53  | with more than 3 days this year |
|------+-------------------+-------+---------------------------------|
| 5    | Monday            | 0-53  | with a Monday in this year      |
|------+-------------------+-------+---------------------------------|
| 6    | Sunday            | 1-53  | with more than 3 days this year |
|------+-------------------+-------+---------------------------------|
| 7    | Monday            | 1-53  | with a Monday in this year      |
+--------------------------------------------------------------------+

So what we really wanted was the WEEK(column, 3); mode:

mysql> SELECT start, WEEK(start, 3) FROM test;
+---------------------+----------------+
| start               | WEEK(start, 3) |
+---------------------+----------------+
| 2008-12-27 00:00:00 |             52 | 
| 2008-12-28 00:00:00 |             52 | 
| 2008-12-29 00:00:00 |              1 | 
| 2008-12-30 00:00:00 |              1 | 
| 2008-12-31 00:00:00 |              1 | 
| 2009-01-01 00:00:00 |              1 | 
| 2009-01-02 00:00:00 |              1 | 
| 2009-01-03 00:00:00 |              1 | 
| 2009-01-04 00:00:00 |              1 | 
| 2009-01-05 00:00:00 |              2 | 
+---------------------+----------------+
10 rows in set (0.00 sec)

So, take care when using MySQL's WEEK() function, and always make sure to read the manual on all the functions you use at least once.

MyQryReplayer

I've written a tool called MyQryReplayer:

MyQryReplayer is a tool which can read the MySQL query log and replay an entire session's worth of queries against a database (SELECT queries only by default). While doing so, it records the time each query took to run, and any queries that failed including their error messages. MyQryReplayer can be used to inspect query performance, and to check a log of queries against a database for possible errors (when upgrading to a new version of MySQL for example).

Get version 0.1 here.

SQL: Remove duplicate rows

In the post SQL: Find duplicate rows I explained how you can find duplicate rows in a table. However, if you want to delete them, you'll run into a little problem as you have to delete all the duplicate rows except for one. Lots of solutions are floating around on the Internet, but most either don't work at all or don't work with MySQL.

Temporary table

One solution that works with MySQL is to create a temporary table to hold the data without duplicates and using the SELECT described in the post mentioned above (without the HAVING clause) and then dropping the original table and recreating it.

Suppose we have a table 'Foo' with three columns:

Foo:
id int(8) auto_increment
field1 int(8)
field2 int(8)

'field1' and 'field2' don't have a unique constraint, so they might contain duplicates. To remove these duplicates, we do:

CREATE TEMPORARY TABLE tmp
  SELECT min(id) as min, field1, field2
  FROM Foo
  GROUP BY Foo.field1, Foo.field2;

DROP TABLE Foo;

CREATE TABLE Foo
  SELECT * FROM tmp;

This will fill the 'tmp' table with unique rows from table 'Foo', and uses the lowest 'id' value for each row. So if we have two rows:

id  field1  field2
------------------
1   hi      bye 
2   hey     see ya
3   hello   goodbye
4   hi      bye

We will end up with:

id  field1  field2
------------------
1   hi      bye 
2   hey     see ya
3   hello   goodbye

In place

The above solution works well, but for some reason I don't like it. I don't like making temporary tables. I want to do it with a single query, modifying the table in place. You might call it pigheaded, as the solution presented below doesn't work as well on large datasets as the one above, but still, I found a way.

Another way to remove duplicates in MySQL is by using the following query:

DELETE 
  Foo 
FROM 
  Foo, Foo t2 
WHERE 
  Foo.field1=t2.field1 AND 
  Foo.field2=t2.field2 AND 
  Foo.id < t2.id

There we go. Explanation: This first does a Cartesion Join on every row where the two fields 'field1' and 'field2' are the same. So if we start with this data:

1 A B  <- duplicate
2 A B  <- duplicate
3 A C

we get the Cartesian product (all possible combinations) for the 'id' column for every unique set of the 'field1' and 'field2' columns:

1 A B 1 A B
1 A B 2 A B
2 A B 1 A B
2 A B 2 A B
3 A C 3 A C

Next we say 'AND Foo.id < t2.id'. This will leave us with following row, which is the only one where the left id is smaller than the right id:

1 A B 2 A B

It then DELETE's that row from Foo (that is, the row '1 A B' - remember that the two rows with left id '2' are actually one and the same row), leaving:

2 A B
3 A C

There we go. If you want to keep the smallest 'id' for each unique set of 'field1' and 'field2', simple use 'AND Foo.id > t2.id'. It works the same way.

Remember that this is incredibly slow for large tables due to the Cartesion Join, unless you've got indexes on 'field1' and 'field2' and don't have a lot of duplicates. Otherwise, use the temporary table solution above.

The Unexpected SQL Injection

Something every PHP developer should be reading:

The Unexpected SQL Injection – When Escaping Is Not Enough

The conclusions:

  • Write properly quoted SQL:
    1. Single quotes around values (string literals and numbers)
    2. Backtick quotes around identifiers (databases, tables, columns, aliases)
  • Properly escape the strings and numbers:
    1. mysql_real_escape_string() for all values (string literals and numbers)
    2. intval() for all number values and the numeric parameters of LIMIT
    3. Escape wildcard/regexp metacharacters (addcslashes('%_') for LIKE, and you better avoid REGEXP/RLIKE)
    4. If identifiers (columns, tables or databases) or keywords (such as ASC and DESC) are referenced in the script parameters, make sure (and force) their values are chosen only as one of an explicit set of options
    5. No matter what validation steps you take when processing the user input in your scripts, always do the escaping steps before issuing the query. Validation is not a substitute for escaping!

Like my rule #1 of what I like to call Defensive Coding: Don't be implicit, be explicit. In other words, don't try to escape things you don't want in your strings, simply only leave everything you do want in your strings. A column name in a ORDER BY clause should only consist of A-Z, a-z and 0-9. Anything else in the string invalidates that string.

Very slow subquery due to HAVING

Today, I was quite mystified by a very slow running query. There was a table named 'bar' with about 3000 rows. I wanted to list all the rows that had a duplicate value for a certain field ('foo'), and only those rows. The solution was to build a query that selected the rows where the value of the 'foo' field was in the results of a subquery that selected 'foo' for duplicate values of foo. The query finally looked something like this:

SELECT 
foo
FROM bar
WHERE foo IN (
  SELECT 
  foo
  FROM bar  
  GROUP BY foo
  HAVING COUNT(foo) > 1
)

The inner query (explained in this post) was very fast, and returned only two rows. The outer query, when I ran it like this:

SELECT foo FROM bar WHERE foo IN (1, 2);

also ran very fast. However the combination of the two was extremely slow. I thought this was weird, since there were only two results in the inner query. A colleague of mine and me took a look at the EXPLAIN of the query, and found out it was actually doing a full join of 3000×3000 rows. The use of HAVING threw me off because it appeared in the inner join. But HAVING is always applied very late in the execution process, just before the results are sent to the client. This means MySQL doesn't even look at the HAVING to optimize queries. From the manual:

The HAVING clause is applied nearly last, just before items are sent to the client, with no optimization.

Putting an index on the 'foo' column solved the speed problem, though it's still not as fast as it could be because it's still doing a JOIN on 'foo' with itself, only this time only doing 2×3000 rows.

Vim and PHP: tips

I've been using Vim for years now, but there's still new stuff to learn. Check out this page for the PDF version of the slides of a talk given by Andrei Zmievski on editing PHP with Vim. His configuration files are also available.

Here's my favourite list of tips:

Add the following text to your ~/.vim/ftplugin/php.vim file:

set formatoptions+=tcqlro
let php_sql_query=1
let php_htmlInStrings=1
let php_folding = 1

This will:

  • Turn on automatic text formatting for PHP so that, for instance, Vim automatically inserts a '*' if you press enter inside a /* */ comment.
  • Makes Vim highlight SQL queries in strings.
  • Makes VIm highlight HTML in strings.
  • Allows folding on PHP classes and functions. (With the cursor on the first line of a function, press z-c to hide the function. z-o to show it again. (Close and Open the fold)