Log <-

Archive for May, 2009

Encodings in Python

There's a whole slew of information regarding all kinds of encoding issues out there on the big bad Internet. Some deal with how unicode works, some with what UTF-8 is and how it relates to other encodings and some with how to transform from one encoding to another. All that theory is nice, but I've found a rather worrying lack of practical, understandable and contextual information on dealing with encodings in Python, leading me to think I'd never be able to properly deal with encodings in Python.

So I took the plunge, and tried to find some stuff out. Here's what I came up with. All of this might be terribly wrong though. Encodings are a complicated subject if you ask me, so feel free to correct me if I'm wrong.

NOTICE: This article uses special HTML entities in various places to show output. Depending on your browser, the encodings it supports and the font you are using and its capabilities of showing UTF-8 characters, you may or may not be able to properly see these characters. In these cases a description of the character is given between parenthesis right after the character.

Read the rest of this entry »

Gift certificates

I don't understand gift certificates. I mean, the idea is quite good: A piece of paper that represents a certain value, and which you can then trade for goods of some kind. Much better than dragging all that gold around all the time. So in that regard, gift certificates are an awesome idea. Except that we already have this thingy which is made of paper (most of the time) and that represents a certain value. It's called "money".

The best thing about money is that you can spend it on everything, everywhere. Whereas most gift certificates are only valid in certain stores. The only reason gift certificates make sense is if the giver wants you to spend it in a certain store. I guess that's the reason gift certificates exist: vendor lock-in. Another brilliant way of controlling how and where we spend our money. Capitalism, yay.

I prefer money.

Filling gaps in data when using aggregates

SQL's aggregate functions (COUNT, MIN, MAX, etc) in combination with GROUP BY is great for generating statistics from a database. It's quite easy to retrieve a count of the rows in a database grouped by week number or month. When there are no data points in the table for a particular week or month, however, you will see gaps in the statistics. Take this example:

We create a table that will have an entry for every download done from a site. Also stored is whether the download was done by a registered user or not:

CREATE TABLE downloads (
  id integer primary key auto_increment,
  date datetime,
  registered BOOLEAN
);

INSERT INTO downloads VALUES (NULL, '2009-01-01', FALSE);
INSERT INTO downloads VALUES (NULL, '2009-01-01', TRUE);
INSERT INTO downloads VALUES (NULL, '2009-01-01', FALSE);
INSERT INTO downloads VALUES (NULL, '2009-01-02', FALSE);
INSERT INTO downloads VALUES (NULL, '2009-01-02', FALSE);
INSERT INTO downloads VALUES (NULL, '2009-01-03', TRUE);
INSERT INTO downloads VALUES (NULL, '2009-01-05', FALSE);
INSERT INTO downloads VALUES (NULL, '2009-01-05', FALSE);

The table data shows us there were three downloads on the first of January, two on the second, one on the third, etc.

Now we can gather the total number of downloads per day with the following aggregation query:

SELECT
  DATE(date) AS day,
  COUNT(id) AS downloads
FROM downloads
GROUP BY DAY(date);

+------------+-----------+
| day        | downloads |
+------------+-----------+
| 2009-01-01 |         3 |
| 2009-01-02 |         2 |
| 2009-01-03 |         1 |
| 2009-01-05 |         2 |
+------------+-----------+

As you can see from the results, there are no downloads for the fourth of January, and this gap is reflected in the aggregated result. So if you wanted to use this data to generate a chart, you'd have to fill in the gaps somehow. Doing this using a script isn't that hard, but that has the disadvantage of having create a new script (not to mention a new data table) which then needs to run periodically. It can also quickly become a pain in the ass when data also needs to be presented grouped by week, month, year, etc.

A simple solution to this is to create a filler table that contains zero-valued data (actually no data at all) for every data point you'd want to show. In our case we'll need a table with an entry for every day that our downloads span:

CREATE TABLE dates (
  date datetime
);

INSERT INTO dates VALUES ('2009-01-01');
INSERT INTO dates VALUES ('2009-01-02');
INSERT INTO dates VALUES ('2009-01-03');
INSERT INTO dates VALUES ('2009-01-04');
INSERT INTO dates VALUES ('2009-01-05');
[etcetera]
INSERT INTO dates VALUES ('2009-01-14');
INSERT INTO dates VALUES ('2009-01-15');

We can now perform a LEFT JOIN of the downloads table on our dates table, and we can be sure that no dates will be missing. Any dates for which there is no entry in the downloads table will get a row filled with NULLs. We can filter the dates we which to see by putting a WHERE clause on our dates table.

As you can see, the gaps are nicely filled with 0 values now.

Care must be taken when filtering data from our table with the actual table. Suppose we want to see the number of downloads per day, but only those who were downloaded by registered persons. Normally, we'd do:

SELECT
  DATE(date) AS day,
  COUNT(id) AS downloads
FROM downloads
WHERE
  downloads.registered = 0 AND
  downloads.date BETWEEN '2009-01-01' AND '2009-01-06'
GROUP BY DAY(date) ;

+------------+-----------+
| day        | downloads |
+------------+-----------+
| 2009-01-01 |         2 |
| 2009-01-02 |         2 |
| 2009-01-05 |         2 |
+------------+-----------+

But with our new filler table, we can't do that, or the empty days will get filtered out again. So instead, we must put the filters not in the WHERE clause, but in the LEFT JOIN clause like so:

SELECT
  DATE(dates.date) AS day,
  COUNT(downloads.id) AS downloads
FROM dates
LEFT JOIN downloads ON downloads.date = dates.date AND downloads.registered = 0
WHERE
  dates.date BETWEEN '2009-01-01' AND '2009-01-06'
GROUP BY DAY(dates.date)

+------------+-----------+
| day        | downloads |
+------------+-----------+
| 2009-01-01 |         2 |
| 2009-01-02 |         2 |
| 2009-01-03 |         0 |
| 2009-01-04 |         0 |
| 2009-01-05 |         2 |
| 2009-01-06 |         0 |
+------------+-----------+

And presto! We have gap-fillers once again. Now it's easy to change the way our data is grouped. For example, if you want to group by week instead of day:

SELECT
  WEEK(dates.date, 3) AS week,
  COUNT(downloads.id) AS downloads
FROM dates
LEFT JOIN downloads ON downloads.date = dates.date AND downloads.registered = 0
WHERE WEEK(dates.date, 3) BETWEEN 1 AND 3
GROUP BY WEEK(dates.date, 3)

+--------+-----------+
| week   | downloads |
+--------+-----------+
|      1 |         4 |
|      2 |         2 |
|      3 |         0 |
+--------+-----------+

(Note: week 1 ends on Saturday January 3)

UPDATE:: Make sure you always perform the COUNT() on a field from the actual data (the table you're LEFT JOINing), or you will get counts of 1 for data that actually has no rows!