Electricmonk

Ferry Boender

Programmer, DevOpper, Open Source enthusiast.

Blog

Filling gaps in data when using aggregates

Tuesday, May 12th, 2009

SQL's aggregate functions (COUNT, MIN, MAX, etc) in combination with GROUP BY is great for generating statistics from a database. It's quite easy to retrieve a count of the rows in a database grouped by week number or month. When there are no data points in the table for a particular week or month, however, you will see gaps in the statistics. Take this example:

We create a table that will have an entry for every download done from a site. Also stored is whether the download was done by a registered user or not:

CREATE TABLE downloads (
  id integer primary key auto_increment,
  date datetime,
  registered BOOLEAN
);

INSERT INTO downloads VALUES (NULL, '2009-01-01', FALSE);
INSERT INTO downloads VALUES (NULL, '2009-01-01', TRUE);
INSERT INTO downloads VALUES (NULL, '2009-01-01', FALSE);
INSERT INTO downloads VALUES (NULL, '2009-01-02', FALSE);
INSERT INTO downloads VALUES (NULL, '2009-01-02', FALSE);
INSERT INTO downloads VALUES (NULL, '2009-01-03', TRUE);
INSERT INTO downloads VALUES (NULL, '2009-01-05', FALSE);
INSERT INTO downloads VALUES (NULL, '2009-01-05', FALSE);

The table data shows us there were three downloads on the first of January, two on the second, one on the third, etc.

Now we can gather the total number of downloads per day with the following aggregation query:

SELECT 
  DATE(date) AS day, 
  COUNT(id) AS downloads
FROM downloads 
GROUP BY DAY(date);

+------------+-----------+
| day        | downloads |
+------------+-----------+
| 2009-01-01 |         3 |
| 2009-01-02 |         2 |
| 2009-01-03 |         1 |
| 2009-01-05 |         2 |
+------------+-----------+

As you can see from the results, there are no downloads for the fourth of January, and this gap is reflected in the aggregated result. So if you wanted to use this data to generate a chart, you'd have to fill in the gaps somehow. Doing this using a script isn't that hard, but that has the disadvantage of having create a new script (not to mention a new data table) which then needs to run periodically. It can also quickly become a pain in the ass when data also needs to be presented grouped by week, month, year, etc.

A simple solution to this is to create a filler table that contains zero-valued data (actually no data at all) for every data point you'd want to show. In our case we'll need a table with an entry for every day that our downloads span:

CREATE TABLE dates (
  date datetime
);

INSERT INTO dates VALUES ('2009-01-01');
INSERT INTO dates VALUES ('2009-01-02');
INSERT INTO dates VALUES ('2009-01-03');
INSERT INTO dates VALUES ('2009-01-04');
INSERT INTO dates VALUES ('2009-01-05');
[etcetera]
INSERT INTO dates VALUES ('2009-01-14');
INSERT INTO dates VALUES ('2009-01-15');

We can now perform a LEFT JOIN of the downloads table on our dates table, and we can be sure that no dates will be missing. Any dates for which there is no entry in the downloads table will get a row filled with NULLs. We can filter the dates we which to see by putting a WHERE clause on our dates table.

SELECT
  DATE(dates.date) AS day,
  COUNT(downloads.id) AS downloads
FROM dates
LEFT JOIN downloads ON downloads.date = dates.date
WHERE dates.date BETWEEN '2009-01-01' AND '2009-01-06'
GROUP BY DAY(dates.date)

+------------+-----------+
| day        | downloads |
+------------+-----------+
| 2009-01-01 |         3 |
| 2009-01-02 |         2 |
| 2009-01-03 |         1 |
| 2009-01-04 |         0 |
| 2009-01-05 |         2 |
| 2009-01-06 |         0 |
+------------+-----------+

As you can see, the gaps are nicely filled with 0 values now.

Care must be taken when filtering data from our table with the actual table. Suppose we want to see the number of downloads per day, but only those who were downloaded by registered persons. Normally, we'd do:

SELECT 
  DATE(date) AS day, 
  COUNT(id) AS downloads 
FROM downloads 
WHERE 
  downloads.registered = 0 AND
  downloads.date BETWEEN '2009-01-01' AND '2009-01-06'
GROUP BY DAY(date) ;

+------------+-----------+
| day        | downloads |
+------------+-----------+
| 2009-01-01 |         2 |
| 2009-01-02 |         2 |
| 2009-01-05 |         2 |
+------------+-----------+

But with our new filler table, we can't do that, or the empty days will get filtered out again. So instead, we must put the filters not in the WHERE clause, but in the LEFT JOIN clause like so:

SELECT
  DATE(dates.date) AS day,
  COUNT(downloads.id) AS downloads
FROM dates
LEFT JOIN downloads ON downloads.date = dates.date AND downloads.registered = 0
WHERE 
  dates.date BETWEEN '2009-01-01' AND '2009-01-06'
GROUP BY DAY(dates.date)

+------------+-----------+
| day        | downloads |
+------------+-----------+
| 2009-01-01 |         2 |
| 2009-01-02 |         2 |
| 2009-01-03 |         0 |
| 2009-01-04 |         0 |
| 2009-01-05 |         2 |
| 2009-01-06 |         0 |
+------------+-----------+

And presto! We have gap-fillers once again. Now it's easy to change the way our data is grouped. For example, if you want to group by week instead of day:

SELECT
  WEEK(dates.date, 3) AS week,
  COUNT(downloads.id) AS downloads
FROM dates
LEFT JOIN downloads ON downloads.date = dates.date AND downloads.registered = 0
WHERE WEEK(dates.date, 3) BETWEEN 1 AND 3
GROUP BY WEEK(dates.date, 3)

+--------+-----------+
| week   | downloads |
+--------+-----------+
|      1 |         4 |
|      2 |         2 |
|      3 |         0 |
+--------+-----------+

(Note: week 1 ends on Saturday January 3)

UPDATE:: Make sure you always perform the COUNT() on a field from the actual data (the table you're LEFT JOINing), or you will get counts of 1 for data that actually has no rows!

The text of all posts on this blog, unless specificly mentioned otherwise, are licensed under this license.