Log <-

Archive for the ‘programming’ Category

RSS   RSS feed for this category

Weighted Random Distribution

Wednesday, December 23rd, 2009

Preface

Randomly selecting elements from a set of items is easy. Just generate a random number between 0 and the length of the set minus one, and use that as an index in the set (if it is an array) to get a random entry. The chance that an entry is picked is the same for each entry in the set. This is called even distribution or uniform distribution.

But what if we do not want each entry to appear as much as every other? Suppose we're creating a question-answer game, and we want the questions the user got wrong previously to appear more often than the question he or she got right? This is called a Weighted Random Distribution, or sometimes Weighted Random Choice, and there are multiple methods of implementing such as random picker.

This article explains these various methods of implementing Weighted Random Distribution along with their pros and cons. We use Python as our language of choice, because it has an easy to read syntax, and provides many useful tools which would take many more lines of code in most other languages. Along the way all Python "tricks" will be explained.

(more…)

Excluding results of a 'find' command (inverting tests)

Tuesday, November 10th, 2009

In kind of a follow up to my previous post on using find and sed to search and replace multiple files, I found out something else.

I needed to find and replace something in every file, except for any files which had ".svn" in them. After struggling for a few fruitless minutes with -regex, I stumbled upon this example in the manual page:

find /sbin /usr/sbin -executable \! -readable -print

   Search for files which are executable but not readable.

The \! allows us to invert the tests after it. Perfect! All we need to do is use -regex to do our excluding:

find . -type f \! -regex ".*\.svn.*"

And we can now search and replace in all files except those that have ".svn" in them:

find . -type f \! -regex ".*\.svn.*" -print0 | xargs -0 sed -i "s/foo/bar/"

Neat. Note that, again, -regex is a GNU find only construct.

Templum v0.4.0 released (Simple PHP templating)

Tuesday, November 10th, 2009

I've released Templum v0.4.0

Templum is an extremely lightweight, simple yet powerful and fast templating engine for PHP. It re-uses the power of PHP itself for rendering templates, but provides additional features making it easier to write templating code. Rendering templates using Templum is very fast; it approximates native PHP rendering speed for include() statements.

This release features:

  • Some small bug fixes
  • Documentation updates
  • The ability to include other templates in a template

Download instructions here.

Linux search and replace

Monday, November 9th, 2009

I always kept a small Python script around for searching and replacing in Linux. Turns out that GNU sed has an inline edit mode which I didn't know about:

       -i[SUFFIX], --in-place[=SUFFIX]

              edit files in place (makes backup if extension supplied)

This makes searching and replacing in files as simple as:

find . -name "*.txt" -print0 | xargs -0 sed -i "s/foo/bar/"

This replaces all occurences of "foo" with "bar" in all the .txt files in or below the current directory.

Unfortunately, -i appears to be a GNU extension, so it won't work on *BSD or Solaris, probably.

Simple function caching in Python

Tuesday, September 15th, 2009

Python dynamic nature continues to astound me. I was working on a small library when I noted it did a lot of redundant IO calls. Now I'm not one for premature optimization, but a while ago I was thinking about writing a decorator or something that would wrap around functions and methods, and cache the returns. Turns out it's way easier than I thought, and I whipped this up in a couple of minutes:

#!/usr/bin/python
#
# Public Domain.
 
__cache = {} # Global cache
 
def fncache(fn):
   """
   Function caching decorator. Keeps a cache of the return 
   value of a function and serves from cache on consecutive
   calls to the function. 
 
   Cache keys are computed from a hash of the function 
   name and the parameters (this differentiates between 
   instances through the 'self' param). Only works if 
   parameters have a unique repr() (almost everything).
 
   Example:
 
   >>> @fncache
   ... def greenham(a, b=2, c=3):
   ...   print 'CACHE MISS'
   ...   return('I like turtles')
   ... 
   >>> print greenham(1)           # Cache miss
   CACHE MISS
   I like turtles
   >>> print greenham(1)           # Cache hit
   I like turtles
   >>> print greenham(1, 2, 3)     # Cache miss (even though default params)
   CACHE MISS
   I like turtles
   >>> print greenham(2, 2, ['a']) # Cache miss
   CACHE MISS
   I like turtles
   >>> print greenham(2, 2, ['b']) # Cache miss
   CACHE MISS
   I like turtles
   >>> print greenham(2, 2, ['a']) # Cache hit
   I like turtles
   """
   def new(*args, **kwargs):
      h = hash(repr(fn) + repr(args) + repr(kwargs))
      if not h in __cache:
         __cache[h] = fn(*args, **kwargs)
      return(__cache[h])
   new.__doc__ = "%s %s" % (fn.__doc__, "(cached)")
   return(new)
 
if __name__ == '__main__':
   import doctest
   doctest.testmod()

Save to a file named 'fncache.py' and import it into your program. Then decorate your functions and methods with it, and their output will be cached. Python rocks. Remember, only use for functions that do heavy calculations, file or network IO. Determining the uniqueness of the function call is rather expensive.

HIrcd – minimal IRC server in Python

Monday, September 14th, 2009

I wrote a little IRC server in Python:

HIrcd is a minimal, hacky implementation of an IRC server daemon written in Python in about 400 lines of code, including comments, etc.

It is mostly useful as a testing tool or perhaps for building something like a private proxy on. Do NOT use it in any kind of production code or anything that will ever be connected to by the public.

Direct link to the source code, for those interested.

Simple Revision Control with RCS

Tuesday, June 9th, 2009

I do a lot of stuff in text file on my system. I keep notes and todo's in them, I write drafts using a text editor and even complete user guides and other documentation. I also keep my personal planning in Gnome Planner which stores its data in an XML file, which in essence is also text. Every so often I wish I could revert a draft back to an older version, or perhaps just have little peak at how it was a couple of days ago. I also really wanted to be able to look back at my planning to see where I made mistakes and underestimated the time required for a particular task.

A revision control system would be ideal in these circumstances. However a full-featured revision control system such as Subversion or Git would be a little overkill in this situation. Enter our old and forgotten friend RCS. RCS has been around since at least 1982, and has since been deprecated by CVS, Subversion and the various other centralized and decentralized versioning systems. It is however perfectly suited for keeping revision histories of single or a small collection of files. RCS is easier to understand than just about any other revision control system too.

Let's look at how it works. First, we have to install it. It used to be installed by default on most systems, but that's no longer the case. If you're running a Debian-based distribution (such as Ubuntu), you can simply install it by typing:

aptitude install rcs

We've now got the RCS revision system installed, and we can start using it.

Let's create an empty text file, called 'simplerevisioncontrolwithrcs.txt' in the 'drafts' directory:

~$ cd drafts
~/drafts$ touch simple_revision_control_with_rcs.txt

RCS consists of a couple of simple command-line utilities:

  • 'ci': Check In RCS revisions. This allows you to save the changes made to a file that is managed by RCS.
  • 'co': Check out RCS working copy. This command retrieves a specific version (the latest by default) of a file in a RCS repository.

So let's convert our empty draft to an RCS repository by checking it in. This means RCS will create a file based on the file you want to check in with the appropriate information such as when it has been been changed, what changed, etc.

~/drafts/$ ci simple_revision_control_with_rcs.txt
simple_revision_control_with_rcs.txt,v  <--  simple_revision_control_with_rcs.txt
enter description, terminated with single '.' or end of file:
NOTE: This is NOT the log message!
>> Simple revision control with RCS
>> .
initial revision: 1.1
done

There. Our initial empty file is gone, and in its place is a file under RCS revision control: 'simplerevisioncontrolwithrcs.txt,v'. It has been given a revision number by RCS: revision 1.1.

We can now make a working copy of this file by doing a checkout. This will retrieve the latest modifications from the text file and create a new file with the original name in the current directory. The first thing we should do though, is to turn off locking. We're working on this file ourselves, and it will never be shared, so we don't care about locking at all. Locking can be turned off using the '-U' option of the 'rcs' tool:

~/drafts$ rcs -U ./simple_revision_control_with_rcs.txt,v 
RCS file: ./simple_revision_control_with_rcs.txt,v
done

Now we can do a checkout so we can edit our draft. Let's do this, and add a line to it:

~/drafts$ co simple_revision_control_with_rcs.txt,v 
simple_revision_control_with_rcs.txt,v  -->  simple_revision_control_with_rcs.txt
revision 1.1
done
~/drafts$ echo "Hello world" > simple_revision_control_with_rcs.txt

Now let's commit this change to the repository so RCS knows about it:

~/drafts$ ci simple_revision_control_with_rcs.txt
simple_revision_control_with_rcs.txt,v  <--  simple_revision_control_with_rcs.txt
new revision: 1.2; previous revision: 1.1
enter log message, terminated with single '.' or end of file:
>> Added greeting.
>> .
done

RCS keeps an entire history of all changes made to the files it keeps under revision each time you do a check in. Take a look at the history of the file with the command 'rlog':

~/drafts$ rlog ./simple_revision_control_with_rcs.txt,v
RCS file: ./simple_revision_control_with_rcs.txt,v
Working file: simple_revision_control_with_rcs.txt
head: 1.2
branch:
locks: non-strict
access list:
symbolic names:
keyword substitution: kv
total revisions: 2; selected revisions: 2
description:
Simple revision control with RCS
----------------------------
revision 1.2
date: 2009/06/09 07:48:50;  author: todsah;  state: Exp;  lines: +1 -0
Added greeting.
----------------------------
revision 1.1
date: 2009/06/09 07:46:24;  author: todsah;  state: Exp;
Initial revision
=============================================================================

That lists some basic information on the file, the repository and the two revisions we've made so far. We can now easily check out earlier revisions of the file by specifying a specific revision, or a date:

~/drafts$ co -r1.1 simple_revision_control_with_rcs.txt
simple_revision_control_with_rcs.txt,v  -->  simple_revision_control_with_rcs.txt
revision 1.1
done
~/drafts$ cat simple_revision_control_with_rcs.txt
~/drafts$

RCS has given us back our file as it existed at revision 1.1, which was the empty file. Let's retrieve the version of the file as it was at 14:00 today:

~/drafts$ co -d"14:00" simple_revision_control_with_rcs.txt,v
simple_revision_control_with_rcs.txt,v  -->  simple_revision_control_with_rcs.txt
revision 1.2
writable simple_revision_control_with_rcs.txt exists; remove it? [ny](n): y
done

In conclusion: RCS may appear to be completely dead, but it still has its uses. I know many people out there will think to themselves "Why not just use Git or Subversion"? I understand the attraction of very powerful tools, but I also really believe in keeping things simple, and using the right tool for the job. I don't know much about Git, so perhaps it is ideal for such a simple requirement, but I don't really feel like learning it just to version control a single file.

Encodings in Python

Friday, May 22nd, 2009

There's a whole slew of information regarding all kinds of encoding issues out there on the big bad Internet. Some deal with how unicode works, some with what UTF-8 is and how it relates to other encodings and some with how to transform from one encoding to another. All that theory is nice, but I've found a rather worrying lack of practical, understandable and contextual information on dealing with encodings in Python, leading me to think I'd never be able to properly deal with encodings in Python.

So I took the plunge, and tried to find some stuff out. Here's what I came up with. All of this might be terribly wrong though. Encodings are a complicated subject if you ask me, so feel free to correct me if I'm wrong.

NOTICE: This article uses special HTML entities in various places to show output. Depending on your browser, the encodings it supports and the font you are using and its capabilities of showing UTF-8 characters, you may or may not be able to properly see these characters. In these cases a description of the character is given between parenthesis right after the character.

(more…)

Filling gaps in data when using aggregates

Tuesday, May 12th, 2009

SQL's aggregate functions (COUNT, MIN, MAX, etc) in combination with GROUP BY is great for generating statistics from a database. It's quite easy to retrieve a count of the rows in a database grouped by week number or month. When there are no data points in the table for a particular week or month, however, you will see gaps in the statistics. Take this example:

We create a table that will have an entry for every download done from a site. Also stored is whether the download was done by a registered user or not:

CREATE TABLE downloads (
  id integer primary key auto_increment,
  date datetime,
  registered BOOLEAN
);

INSERT INTO downloads VALUES (NULL, '2009-01-01', FALSE);
INSERT INTO downloads VALUES (NULL, '2009-01-01', TRUE);
INSERT INTO downloads VALUES (NULL, '2009-01-01', FALSE);
INSERT INTO downloads VALUES (NULL, '2009-01-02', FALSE);
INSERT INTO downloads VALUES (NULL, '2009-01-02', FALSE);
INSERT INTO downloads VALUES (NULL, '2009-01-03', TRUE);
INSERT INTO downloads VALUES (NULL, '2009-01-05', FALSE);
INSERT INTO downloads VALUES (NULL, '2009-01-05', FALSE);

The table data shows us there were three downloads on the first of January, two on the second, one on the third, etc.

Now we can gather the total number of downloads per day with the following aggregation query:

SELECT
  DATE(date) AS day,
  COUNT(id) AS downloads
FROM downloads
GROUP BY DAY(date);

+------------+-----------+
| day        | downloads |
+------------+-----------+
| 2009-01-01 |         3 |
| 2009-01-02 |         2 |
| 2009-01-03 |         1 |
| 2009-01-05 |         2 |
+------------+-----------+

As you can see from the results, there are no downloads for the fourth of January, and this gap is reflected in the aggregated result. So if you wanted to use this data to generate a chart, you'd have to fill in the gaps somehow. Doing this using a script isn't that hard, but that has the disadvantage of having create a new script (not to mention a new data table) which then needs to run periodically. It can also quickly become a pain in the ass when data also needs to be presented grouped by week, month, year, etc.

A simple solution to this is to create a filler table that contains zero-valued data (actually no data at all) for every data point you'd want to show. In our case we'll need a table with an entry for every day that our downloads span:

CREATE TABLE dates (
  date datetime
);

INSERT INTO dates VALUES ('2009-01-01');
INSERT INTO dates VALUES ('2009-01-02');
INSERT INTO dates VALUES ('2009-01-03');
INSERT INTO dates VALUES ('2009-01-04');
INSERT INTO dates VALUES ('2009-01-05');
[etcetera]
INSERT INTO dates VALUES ('2009-01-14');
INSERT INTO dates VALUES ('2009-01-15');

We can now perform a LEFT JOIN of the downloads table on our dates table, and we can be sure that no dates will be missing. Any dates for which there is no entry in the downloads table will get a row filled with NULLs. We can filter the dates we which to see by putting a WHERE clause on our dates table.

As you can see, the gaps are nicely filled with 0 values now.

Care must be taken when filtering data from our table with the actual table. Suppose we want to see the number of downloads per day, but only those who were downloaded by registered persons. Normally, we'd do:

SELECT
  DATE(date) AS day,
  COUNT(id) AS downloads
FROM downloads
WHERE
  downloads.registered = 0 AND
  downloads.date BETWEEN '2009-01-01' AND '2009-01-06'
GROUP BY DAY(date) ;

+------------+-----------+
| day        | downloads |
+------------+-----------+
| 2009-01-01 |         2 |
| 2009-01-02 |         2 |
| 2009-01-05 |         2 |
+------------+-----------+

But with our new filler table, we can't do that, or the empty days will get filtered out again. So instead, we must put the filters not in the WHERE clause, but in the LEFT JOIN clause like so:

SELECT
  DATE(dates.date) AS day,
  COUNT(downloads.id) AS downloads
FROM dates
LEFT JOIN downloads ON downloads.date = dates.date AND downloads.registered = 0
WHERE
  dates.date BETWEEN '2009-01-01' AND '2009-01-06'
GROUP BY DAY(dates.date)

+------------+-----------+
| day        | downloads |
+------------+-----------+
| 2009-01-01 |         2 |
| 2009-01-02 |         2 |
| 2009-01-03 |         0 |
| 2009-01-04 |         0 |
| 2009-01-05 |         2 |
| 2009-01-06 |         0 |
+------------+-----------+

And presto! We have gap-fillers once again. Now it's easy to change the way our data is grouped. For example, if you want to group by week instead of day:

SELECT
  WEEK(dates.date, 3) AS week,
  COUNT(downloads.id) AS downloads
FROM dates
LEFT JOIN downloads ON downloads.date = dates.date AND downloads.registered = 0
WHERE WEEK(dates.date, 3) BETWEEN 1 AND 3
GROUP BY WEEK(dates.date, 3)

+--------+-----------+
| week   | downloads |
+--------+-----------+
|      1 |         4 |
|      2 |         2 |
|      3 |         0 |
+--------+-----------+

(Note: week 1 ends on Saturday January 3)

UPDATE:: Make sure you always perform the COUNT() on a field from the actual data (the table you're LEFT JOINing), or you will get counts of 1 for data that actually has no rows!

Templum v0.2.0: Simple PHP Templating

Sunday, April 26th, 2009

I just released v0.2.0 of Templum, a simple templating engine for PHP.

About

From the homepage:

Templum is an extremely lightweight, simple yet powerful and fast templating engine for PHP. It re-uses the power of PHP itself for rendering templates, but provides additional features making it easier to write templating code. Rendering templates using Templum is very fast; it approximates native PHP rendering speed for include() statements.

Changes

Changes in this release:

  • PHP 4 support added (patch by Pierre Jochem).
  • Bugfix (#1): {{$var}} at end-of-line eats newline following it.
  • Various examples added.
  • Added the ability to turn off automatic escaping using htmlentities().
  • Improved the error reporting.
  • The locale can now be changed after creating a Templum instance.
  • Userguide updated.

This release is backwards compatible with the previous version 0.1.0.

Install

You can install Templum v0.2.0 using PEAR:

pear install http://templum.electricmonk.nl/releases/templum-0.2.0.tgz

If you've got a previous version of Templum installed, you must first uninstall that one:

pear uninstall channel://__uri/templum-0.1.0

There's also a non-PEAR tar.gz which also contains examples and the API documentation and Userguide:

templum-src-0.2.0.tar.gz.

More information