contact
----------------------------

Blog <-

Archive for the ‘programming’ Category

RSS   RSS feed for this category

Increasing performance of bulk updates of large tables in MySQL

I recently had to perform some bulk updates on semi-large tables (3 to 7 million rows) in MySQL. I ran into various problems that negatively affected the performance on these updates. In this blog post I will outline the problems I ran into and how I solved them. Eventually I managed to increase the performance from about 30 rows/sec to around 7000 rows/sec. The final performance was mostly CPU bound. Since this was on a VPS with only limited CPU power, I expect you can get better performance on some decently outfitted machine/VPS.

The situation I was dealing with was as follows:

  • About 20 tables, 7 of which were between 3 and 7 million rows.

  • Both MyISAM and InnoDB tables.

  • Updates required on values on every row of those tables.

  • The updates where too complicated to do in SQL, so they required a script.

  • All updates where done on rows that were selected on just their primary key. I.e. WHERE id = …

Here are some of the problems I ran into.

Python’s MySQLdb is slow

I implemented the script in Python, and the first problem I ran into is that the MySQLdb module is slow. It’s especially slow if you’re going to use the cursors. MySQL natively doesn’t support cursors, so these are emulated in Python code. One of the trickiest things is that a simple SELECT * FROM tbl will retrieve all the results and put them in memory on the client. For 7 million rows, this quickly exhausts your memory. Real cursors would fetch the result one-by-one from the database so that you don’t exhaust memory.

The solution here is to not use MySQLdb, but use the native client bindings available with import mysql.

LIMIT n,m is slow

Since MySQL doesn’t support cursors, we can’t mix SELECT and UPDATE queries in a loop. Thus we need to read in a bunch of rows into memory and update in a loop afterwards. Since we can’t keep all the rows in memory, we must read in batches. An obvious solution for this would be a loop such as (pseudo-code):

offset = 0
size = 1000
while True:
    rows = query('SELECT * FROM tbl LIMIT :offset, :size"
    for row in rows:
        # do some updates
    if len(rows) < size:
        break
    offset += size

This would use the LIMIT to read the first 1000 rows on the first iteration of the loop, the next 1000 on the second iteration of the loop. The problem is: in MySQL this becomes linearly slower for higher values of the offset. I was already aware of this, but somehow it slipped my mind.

The problem is that the database has to advance an internal pointer forward in the record set, and the further in the table you get, the longer that takes. I saw performance drop from about 5000 rows/sec to about 100 rows/sec, just for selecting the data. I aborted the script after noticing this, but we can assume performance would have crawled to a halt if we kept going.

The solution is to use the order by the primary key and then select everything we haven’t processed yet:

size = 1000
last_id = 0
while True:
    rows = query('SELECT * FROM tbl WHERE id > :last_id ORDER BY id LIMIT :size')
    if not rows:
        break
    for row in rows:
        # do some updates
        last_id = row['id']

This requires that you have an index on the id field, or performance will greatly suffer again. More on that later.

At this point, +SELECT+s were pretty speedy. Including my row data manipulation, I was getting about 40.000 rows/sec. Not bad. But I was not updating the rows yet.

Connection settings

The next things I did is some standard tricks to speed up bulk updates/inserts by disabling some foreign key checks and running batches in a transaction. Since I was working with both MyISAM and InnoDB tables, I just mixed optimizations for both table types:

db.query('SET autocommit=0;')
db.query('SET unique_checks=0; ')
db.query('SET foreign_key_checks=0;')
db.query('LOCK TABLES %s WRITE;' % (tablename))
db.query('START TRANSACTION;')
# SELECT and UPDATE in batches
db.query('COMMIT;')
db.query('UNLOCK TABLES')
db.query('SET foreign_key_checks=1;')
db.query('SET unique_checks=1; ')
db.query('SET autocommit=1;')

I must admit that I’m not sure if this actually increased performance at all. It is entirely possible that this actually hurts performance instead. Please test this for yourselves if you’re going to use it. You should also be aware that some of these options bypass MySQL’s data integrity checks. You may end up with invalid data such as invalid foreign key references, etc.

One mistake I did make was that I accidentally included the following in an early version of the script:

db.query('ALTER TABLE %s DISABLE KEYS;' % (tablename))

Such is the deviousness of copy-paste. This option will disable the updating of non-unique indexes while it’s active. This is an optimization for MyISAM tables to massively improve performance of mass INSERTs, since the database won’t have to update the index on each inserted row (which is very slow). The problem is that this also disables the use of indexes for data retrieving, as noted in the MySQL manual:

While the nonunique indexes are disabled, they are ignored for statements such as SELECT and EXPLAIN that otherwise would use them.

That means update queries such as UPDATE tbl SET key=value WHERE id=1020033 will become incredibly slow, since they can no longer use indexes.

MySQL server tuning

I was running this on a stock Ubuntu 12.04 installation. MySQL is basically completely unconfigured out of the box on Debian and Ubuntu. This means that those 16 GBs of memory in your machine will go completely unused unless you tune some parameters. I modified /etc/mysql/my.cnf and added the following settings to improve the speed of queries:

[mysqld]
key_buffer         = 128M
innodb_buffer_pool_size = 3G

The key_buffer setting is a setting for MyISAM tables that determines how much memory may be used to keep indexes in memory. The equivalent setting for InnoDB is innodb_buffer_pool_size, except that the InnoDB setting also includes table data.

In my case the machine had 4 GB of memory. You can read more about the settings on these pages:

Don’t forget to restart your MySQL server.

Dropping all indexes except primary keys

One of the biggest performance boosts was to drop all indexes from all the tables that needed to be updates, except for the primary key indexes (those on the id fields). It is much faster to just drop the indexes and recreate them when you’re done. This is basically the manual way to accomplish what we hoped the ALTER TABLE %s DISABLE KEYS would do, but didn’t.

UPDATE: I wrote a better script which is available here.

Here’s a script that dumps SQL commands to drop and recreate indexes for all tables:

#!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
# DANGER WILL ROBINSON, READ THE important notes BELOW
#!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
import _mysql
import sys

mysql_username = 'root'
mysql_passwd = 'passwd'
mysql_host = '127.0.0.1'
dbname = 'mydb'

tables = sys.argv[1:]
indexes = []

db = _mysql.connect(user=mysql_username, passwd=mysql_passwd, db=dbname)
db.query('SHOW TABLES')
res = db.store_result()
for row in res.fetch_row(maxrows=0):
    tablename = row[0]
    if not tables or tablename in tables:
        db.query('SHOW INDEXES FROM %s WHERE Key_name != "PRIMARY"' % (tablename))
        res = db.store_result()
        for row_index in res.fetch_row(maxrows=0):
            table, non_unique, key_name, seq_in_index, column_name, \
            collation, cardinality, sub_part, packed, null, index_type, \
            comment, index_comment = row_index
            indexes.append( (key_name, table, column_name) )

for index in indexes:
    key_name, table, column_name = index
    print "DROP INDEX %s ON %s;" % (key_name, table)

for index in indexes:
    key_name, table, column_name = index
    print "CREATE INDEX %s ON %s (%s);" % (key_name, table, column_name)

Output looks like this:

$ ./drop_indexes.py
DROP INDEX idx_username ON users;
DROP INDEX idx_rights ON rights;
CREATE INDEX idx_username ON users (username);
CREATE INDEX idx_perm ON rights (perm);

Some important notes about the above script:

  • The script is not foolproof! If you have non-BTREE indexes, if you have indexes spanning multiple columns, if you have any kind of index that goes beyond a BTREE single column index, please be careful about using this script.

  • You must manually copy-paste the statements into the MySQL client.

  • It does NOT drop the PRIMARY KEY indexes.

Conclusions

In the end, I went from about 30 rows per second around 8000 rows per second. The key to getting decent performance is too start simple, and slowly expand your script while keeping a close eye on performance. If you see a dip, investigate immediately to mitigate the problem.

Useful ways of investigation slow performance is by using tools to unearth evidence of the root of the problem.

  • top can tell you if a process is mostly CPU bound. If you’re seeing high amounts of CPU, check if your queries are using indexes to get the results they need.

  • iostat can tell you if a process is mostly IO bound. If you’re seeing high amounts of I/O on your disk, tune MySQL to make better use of memory to buffer indexes and table data.

  • Use the EXPLAIN function of MySQL to see if, and which, indexes are being used. If not, create new indexes.

  • Avoid doing useless work such as updating indexes after every update. This is mostly a matter of knowing what to avoid and what not, but that’s what this post was about in the first place.

  • Baby steps! It took me entirely too long to figure out that I was initially seeing bad performance because my SELECT LIMIT n,m was being so slow. I was completely convinced my UPDATE statements were the cause of the initial slowdowns I saw. Only when I started commenting out the major parts of the code did I see that it was actually the simple SELECT query that was causing problems initially.

That’s it! I hope this was helpful in some way!

Quick-n-dirty HAR (HTTP Archive) viewer

HAR, HTTP Archive, is a JSON-encoded dump of a list of requests and their associated headers, bodies, etc. Here's a partial example containing a single request:

{
  "startedDateTime": "2013-09-16T18:02:04.741Z",
  "time": 51,
  "request": {
    "method": "GET",
    "url": "http://electricmonk.nl/",
    "httpVersion": "HTTP/1.1",
    "headers": [],
    "queryString": [],
    "cookies": [],
    "headersSize": 38,
    "bodySize": 0
  },
  "response": {
    "status": 301,
    "statusText": "Moved Permanently",
    "httpVersion": "HTTP/1.1",
    "headers": [],
    "cookies": [],
    "content": {
      "size": 0,
      "mimeType": "text/html"
    },
    "redirectURL": "",
    "headersSize": 32,
    "bodySize": 0
  },
  "cache": {},
  "timings": {
    "blocked": 0,
  }
},

HAR files can be exported from Chrome's Network analyser developer tool (ctrl-shift-i → Network tab → capture some requests → Right-click and select Save as HAR with contents. (Additional tip: Check the "Preserve Log on Navigation option – which looks like a recording button – to capture multi-level redirects and such)

As human-readable JSON is, it's still difficult to get a good overview of the requests. So I wrote a quick Python script that turns the JSON into something that's a little easier on our poor sysadmin's eyes:

harview_output

It supports colored output, dumping request headers and response headers and the body of POSTs and responses (although this will be very slow). You can filter out uninteresting requests such as images or CSS/JSS with the --filter-X options.

You can get it by cloning the Git repository from the Bitbucket repository.

Cheers!

bbcloner: create mirrors of your public and private Bitbucket Git repositories

 

bbclonerI wrote a small tool that assists in creating mirrors of your public and private Bitbucket Git repositories and wikis. It also synchronizes already existing mirrors. Initial mirror setup requires that you manually enter your username/password. Subsequent synchronization of mirrors is done using Deployment Keys.

You can download a tar.gz, a Debian/Ubuntu package or clone it from the Bitbucket page.

Features

  • Clone / mirror / backup public and private repositories and wikis.
  • No need to store your username and password to update clones.
  • Exclude repositories.
  • No need to run an SSH agent. Uses passwordless private Deployment Keys. (thus without write access to your repositories)

Usage

Here's how it works in short. Generate a passwordless SSH key:

$ ssh-keygen
Generating public/private rsa key pair.
Enter file in which to save the key: /home/fboender/.ssh/bbcloner_rsa<ENTER>
Enter passphrase (empty for no passphrase):<ENTER>
Enter same passphrase again: <ENTER>

You should add the generated public key to your repositories as a Deployment Key. The first time you use bbcloner, or whenever you've added new public or private repositories, you have to specify your username/password. BBcloner will retrieve a list of your repositories and create mirrors for any new repositories not yet mirrored:

$ bbcloner -n -u fboender /home/fboender/gitclones/
Password: 
Cloning new repositories
Cloning project_a
Cloning project_a wiki
Cloning project_b

Now you can update the mirrors without using a username/password:

$ bbcloner /home/fboender/gitclones/
Updating existing mirrors
Updating /home/fboender/gitclones/project_a.git
Updating /home/fboender/gitclones/project_a-wiki.git
Updating /home/fboender/gitclones/project_b.git

You can run the above from a cronjob. Specify the -s argument to prevent bbcloner from showing normal output.

The mirrors are full remote git repositories, which means you can clone them:

$ git clone /home/fboender/gitclones/project_a.git/
Cloning into project_a...
done.

Don't push changes to it, or the mirror won't be able to sync. Instead, point the remote origin to your Bitbucket repository:

$ git remote rm origin
$ git remote add origin git@bitbucket.org:fboender/project_a.git
$ git push
remote: bb/acl: fboender is allowed. accepted payload.

Get it

Here are ways of getting bbcloner:

More information

Fore more information, please see the Bitbucket repository.

Subversion svn:ignore propery doesn't (seem) to work? [FIXED]

Say you're trying to set the "ignore" property on something in a subversion checkout like this:

svn propset svn:ignore "foo.pyc" .

Next you do a svn status:

M       foo.pyc

It seems it isn't working. In order to fix this, you must remember to first:

  • Remove the file from subversion and commit
  • svn update all the checkouts of that repository so that the file is gone everywhere!
  • Set the svn:ignore propery
  • Now commit the property change, or svn status will still show it (even in the local checkout)!
  • svn update all the checkouts of the repository

So:

host1$ svn rm foo.pyc && svn commit -m "Remove compiled python code"
host2$ svn update
host1$ svn propset svn:ignore "foo.pyc" .
host1$ svn commit -m "Ignore compiled python code"
host2$ svn update

If you get conflicts because you didn't follow these steps exactly:

host2$ svn update
   C foo.pyc
host2$ svn resolve --accept working foo.pyc
host2$ svn rm foo.pyc
host2$ svn update
At revision 123

That should solve it.

If you want all your subversion problems solved, try this.

Input, state and their relationship to bugs

Why are there so many programmers who don't know what state is, much less the impact it has on programming? Recently I was having a discussion online and some programmers kept misinterpreting everything being said by me and other programmers because they had no or little grasp of the concept of state. This is not the first time I've noticed that, for such an important concept, there are a surprising number of programmers who don't understand the concept and implications of state.

The meaning of state

From Wikipedia:

the state of a digital logic circuit or computer program is a technical term for all the stored information, at a given point in time, which the circuit or program has access to. The output of a digital circuit or computer program at any time is completely determined by its current inputs and its state.

Just to be clear, "all the stored information" includes the instruction pointer, CPU registers/caches and so forth. So if at any given point in time your program is in a while loop or conditional branch, that is also part of the state.

"The output of a digital circuit or computer program at any time is completely determined by its current inputs and its state."… Think about that for a moment, because it is important. Everything that happens from that moment on in a computer program is determined by the current inputs and its state. That means, as long as there is no other input, the output of the program is completely predictable at any time. That's a lot (or rather, exactly) like the pseudo-random number generator in most programming languages. If you know the seed, you can predict every random number that will be produced. Each number the pseudo-RNG returns is the result of its input and its current state.

If programs are that predictable, how come there are still so many bugs in them? Well, for one thing, programs are not that predictable at all.

On the Origin of Bugs

Look at the following example:

a = 5
b = 10
c = a + b

Tell me, how many bugs does the program have? If we assume that the syntax is correct and that each variable can hold an integer number, this program contains zero bugs. Now tell me, how many bugs does this program have?

a = int(file('in', 'r').readline())
b = 10;
c = a + b;

How many bugs? One? Three? Do you know? Because I sure don't. Here's a few I can come up with from the top of my head:

  • The file "in" may not exist
  • The file "in" may not be readable
  • The file "in" may disappear between opening and reading
  • The file "in" may be empty
  • The file "in" may not contain something that can be cast to an integer
  • The system may not be able to allocate more file handles
  • The system may not be able to allocate enough memory to read a line from the file

I could go on, but the point should be clear. We changed only one line in our program, and now there are who-knows how many bugs in our program. Why is that? Because in our first example, the state of the program could never be influenced from outside the program at any time, since there was no input! (Yes, the memory might get corrupt from a random flipped bit or the universe might decide to suspend the laws of physics, but again: let's not get pedantic).

When we write a piece of code, we write that code with certain assumptions. If I write:

while (i < 10) { /* ... */ };

I'm assuming variable 'i' contains some kind of number, which at the start of the loop is smaller than 10, and which during the loop will not suddenly change into, say, an open file pointer. Now if I'm not a total idiot, and I write my code properly, that last thing will never happen. Given a piece of code, it's generally not difficult to make correct assumptions about that code. But making assumptions about input into that code (in this case the contents of variable 'i'), regardless of where it came from (interactive input, the network, the filesystem, passed in as a argument into a function, whatever) is much more difficult and dangerous.

I'm going to make a sweeping generalization: Given that "the output of a digital circuit or computer program at any time is completely determined by its current inputs and its state", and given that a program's current state is determined by its starting state and all the input, and that we can thus disregard the "and its state" part of the first statement... I'm going to claim that

All bugs in a program are a result of the input into that program.

Of course, like I said, that statement is sweeping, and therefor false. But for now, assume it's true and think about the implication of it for a moment, while I correct my previous statement.

Predictable state

Of course, not all bugs in a program are a result of input into that program. Sometimes us programmers screw up, and make mistakes. This is another source of bugs in our programs. Sometimes we're tired, and we make mistakes in the syntax of our programs. Sometimes we don't pay attention, and a certain library works differently than we thought. Such bugs are easily solved. In fact, many can be automatically detected by compilers or asserts or whatever.

Another class of bugs is the one where we make assumptions about the current state of a part of our program, and write new code in accordance with those assumptions. Those assumptions could be correct, resulting in working code, or they may be incorrect, resulting in bugs.

A coworker of mine once had to write some code for an ATM. Since ATMs are reasonably sensitive, he wasn't allowed to dynamically allocate memory. That is, he could not use malloc and was only allowed to use what memory was statically allocated by the compiler. Not allowing dynamically allocated memory gets rid of entire range of problems, all of which are directly related to unpredictable state. Buffer overflows, out-of-bounds array referencing, bugs in string copying, reading input... all those problems are greatly mitigated by not having dynamically allocated memory.

You always want the state of your program to be predictable. Input makes the state unpredictable. As soon as we get input into our program, we can only hope that that input adheres to the assumptions the code makes. The current state may be valid (though unpredictable), however we cannot make any claims about future states. Only after we have sufficiently validated and sanitized any input in such a way that it adheres to the code's assumptions can we speak about a predictable state again.

This is why unclear code is such a problem. Unclear code makes it hard to reason about code, thus hard to determine whether code might lead to unpredictable states. Basically, we want to make the assumptions of code as transparent as possible.

I hope everything I've said so far makes absolutely basic sense. I hope that at this point in the article you're thinking "Well, duh", because from what I've seen of many programmers.. they don't have a clue about the relation between input, state and bugs.

"So what?"

It is important to grasp this basic concept of state because it helps you to reason about why and when certain programming techniques are a good or bad idea.

  • Functions are a good idea because they make it easier for the programmer to reason about input and its effect on state within that function. Easier reasoning about state equals better assumptions about code, resulting in less bugs.

  • Object Oriented Programming is a good idea because it groups together data and the logic that operates on that data, and therefor it allows the programmer to encapsulates state in such a way that it is harder to disrupt that state with outside influences.

  • Using global variables is a bad idea because it screws up all the assumptions code makes about that global variable. Wherever you use a global variable, you cannot reason about the state of your program, since the value of the global variable can change at any time.

  • Allowing publicly settable properties in objects (that includes setter methods, for you Java developers out there) is a bad idea because now any assumptions methods in that object make about the state of those properties can be invalid. Conversely, making the constructor the only part of the object that is allowed to have input is a good idea (even if it's not always practical)

  • goto is considered evil because it makes allows for random changing of the state of a program, and therefor can make it incredibly hard for developers to reason about the current state of the program at any point.

  • functional programming is a good idea because it avoids mutable data as much as possible, and therefor it is hard to bring the program into an undefined state.

  • Multi-threaded programs where threads write to the same memory are a bad idea because writes are non-deterministic which cause unpredictable state. This can be avoided using locking, which in itself is incredibly hard to do properly, leading to state which is difficult to reason about.

  • Smaller functions are a good idea because, again, it makes it easier to reason about the state and the changes in that state within the function.

  • ...

I could go on, but I hope you get the point. Please note that these point have to be considered in context. Don't take these as absolutes, because that would go entirely against what I'm trying to say. Goto can most certainly be used to create state that is more predictable and easier to reason about. In general however, it is considered detrimental to predictable state.

Back to input

Lets jump back to input for a moment. The age-old adage goes: "Validate your input!" What does that mean? Quick question.. Given this function:

def foo(a, b):
    c = a / b
    return c

How many bugs are in that function? We can immediately spot a potential division by zero bug and an overflow bug. There are others, but the point is you don't know! The input into this function is not validated. We're making assumptions about the calling code which may or may not be true. The key here is that the concept of "input" doesn't just apply to what you read from they keyboard, a file or the network. It applies to every line of code. Programmers who do not realize this are what I like to call Pavlovian Programmers... Sure, you can teach them a trick (goto is bad!), but they'll never understand the reasoning behind that trick, and they won't be able to apply it to new concept. This is why programmers are jumping on the Event-driven programming bandwagon, despite what an insanely bad idea it is. Basically event-driven programming (the kind that binds anonymous functions as callbacks to events. Yes, javascript et al) is a kind of goto, except for being harder to debug.

Ideally, every function and every constructor would validate its input arguments. (Contracts, anyone?) The stricter you validate input arguments, the less chance there is of winding up in an unpredictable state. Of course I say "ideally", because it is not practical to validate each and every bit of input. A decent compromise would be to validate the input of all constructors that are part of library code. Then again, 99% of your code should be library code. But that's a discussion for another time.

Conclusions

The conclusions of this article are simple:

  • Prevent bugs
  • Write clear code

I'm not trying to be funny here. The conclusions are the same as what you've already heard a thousand times. I just hope that you now view them in a bit of a different light. I believe we can accomplish the above and create much better code if everybody would just realize a few basic concepts:

  • It is vitally important to protect the state at all times.
  • Minimize "outside" influences on the state. This includes input from files, the network, other threads that are running, global variables, etc.
  • The State in a program is unpredictable all the time. Whenever we read input, whenever a function returns or throws an error, our program is in an unpredictable state. The system should, at all times, move from unpredictable and potentially invalid states to a predictable and valid state. This can be accomplished by validating and sanitizing any input into the program.

I leave you with this quote, which I think I once read somewhere else, but cannot find again, so I will attribute to myself:

Programming is the messy art of constantly bringing a program back to a predictable state after receiving a tiny bit of input.

python-libtorrent program doesn't exit

(TL;DR: To the solution)

I was mucking about with the Python bindings for libtorrent, and made something like this:

import libtorrent
 
fname = 'test.torrent'
ses = libtorrent.session()
ses.listen_on(6881, 6891)
 
info = libtorrent.torrent_info(fname)
h = ses.add_torrent({'ti': info, 'save_path': session_dir})
prev_progress = -1
while (not h.is_seed()):
    status = h.status()
    progress = int(round(status.progress * 100))
    if progress != prev_progress:
        print 'Torrenting %s: %i%% done' % (h.name(), progress)
        prev_progress = progress
    time.sleep(1)
 
print "Done torrenting %s" % (h.name())
# ... more code

After running it a few times, I noticed the program would not always terminate. You'd immediately suspect a problem in the while loop condition, but in all cases "Done torrenting Foo" would be printed and then the program would hang.

In celebration of one of the rare occasions that I don't spot a hanging problem in such a simple piece of code right away, I fired up PDB, the Python debugger, which told me:

$ pdb ./tvt 
> /home/fboender/Development/tvtgrab/trunk/src/tvt(9)()
-> import sys
(Pdb) cont
Torrenting Example Torrent v1.0: 100% done
Done torrenting Example Torrent v1.0
The program finished and will be restarted

after which it promptly hung. That last line, "The program finished and will be restarted", that's PDB telling us execution of the program finished. Yet it still hung.

At this point, I was suspecting threads. Since libtorrent is a C++ program, and as the main loop in my code doesn't actually really do anything, it seems libtorrent is doing its thing in the background, and not properly shutting down every now and then. (Although it's more likely I just don't understand what it's doing) It's quite normal for torrent clients to take a while before closing down, especially if there are still peers connected. Most of the time, if I waited long enough, the program would terminate normally. However, sometimes it wouldn't terminate even after an hour, even if no peers were at any point connected to any torrents (the original code does not always load torrents into a session).

Digging through the documentation, I couldn't easily find a method of shutting down the session. I did notice the following:

~session()

The destructor of session will notify all trackers that our torrents have been shut down. If some trackers are down, they will time out. All this before the destructor of session returns. So, it's advised that any kind of interface (such as windows) are closed before destructing the session object. Because it can take a few second for it to finish. The timeout can be set with set_settings().


Seems like libtorrent uses destructors to shut down the session. Adding the following to the end of the code fixed the problem of the script not exiting:

del ses

The del statement in Python calls any destructors (if you're lucky) on that class. Having nearly zero C++ knowledge, I suspect C++ calls destructors automatically at program exit. Python doesn't do that though, so we have to call it manually.

Update: Calling the destructor does not definitively solve the problem. I am still experiencing problems with hangs when calling the session destructor. I will investigate further and update when a solution has been found.

Update II: Well, I've not been able to solve the problem any other way than upgrading to the latest version of libtorrent. So I guess that'll have to do.

Conque: Terminal emulators in Vim buffers

For the longest time, I've searched for a way to run terminal emulators in Vim buffers.

As a kind of work-around, I created Bexec, which allows you to run the current contents of a buffer through an external program. It then captures the output and inserts/appends it to another buffer.

Although Bexec works reasonable, and still has it's uses, it's not a true terminal emulator in Vim. Today I finally found a Vim plugin that let's you actually run interactive commands / terminals in Vim buffers: Conque.

It requires Vim with Python support built in. Installation is straight-forward if you've got the requirements.

Download the .vmb file, edit it in vim, and issue:

:so %

It will then be installed. Quit vim, restart it, and you can now run start using it:

:ConqueTerm bash

Very awesome.

Read less

A programmer once built a vast database containing all the literature, facts, figures, and data in the world. Then he built an advanced querying system that linked that knowledge together, allowing him to wander through the database at will. Satisfied and pleased, he sat down before his computer to enjoy the fruits of his labor.

After three minutes, the programmer had a headache. After three hours, the programmer felt ill. After three days, the programmer destroyed his database. When asked why, he replied: “That system put the world at my fingertips. I could go anywhere, see anything. Because I was no longer limited by external conditions, I had no excuse for not knowing everything there is to know. I could neither sleep nor eat. All I could do was wander through the database. Now I can rest.”

— Geoffrey James, Computer Parables: Enlightenment in the Information Age

I was a major content consumer on the Internet. My Google Reader had over 120 feeds in it. It produced more than a 1000 new items every couple of hours. I religiously read Hacker News, Reddit and a variety of other high-volume sources of content. I have directories full of theoretical science papers, articles on a wide range of topics and many, many tech books. I scoured the web for interesting articles to save to my tablet for later reading. I was interested in everything. Programming, Computer Science, Biology, Theoretical Particle Physics, Psychology, rage-comics, and everything else. I could get lost for hours on Wikipedia, jumping from article to article, somehow, without noticing it, ending up at articles titled "Gross–Pitaevskii equation" or "Grand Duchy of Moscow", when all I needed to know was what the abbreviation "SCPD" stood for. (Which, by the way, Wikipedia doesn't have an article for, and means "Service Control Point Definition")

I want to make it clear I wasn't suffering from Information Overload by any definition. I was learning things. I knew things about technology which I hadn't even ever used myself. I can tell you some of the ins and outs of iPhone development. I don't even own an iPhone. I can talk about Distributed Computing, Transactional Memory and why it is and isn't a good idea, without having written more than a simple producer/consumer routine. I'm even vehemently against writing to shared memory in any situation! I can tell you shit about node.js and certain NoSQL databases without even ever having installed – much less dived into – them. Hell, I don't even like Javascript!

The things is: even though I was learning about stuff, it was superficial knowledge without context and the kind of basic information that allows you to draw these conclusions you're reading about for yourself, without the help of some article. I didn't pause to think about conclusions drawn in an article, or to let the information sink in. I read article after article. I wasn't putting the acquired knowledge into practice. The Learning Pyramid may have been discredited, but I'm convinced that we learn more from doing than we do from reading about something.

So what makes reading so attractive that we'd rather read about things than actually doing them? And I know for a fact that I'm not alone in having this problem. I think – and this might be entirely personal – it's because of a couple of reasons.

One is that it's much easier to read about something than to actually figure things out yourself. I want to experiment with sharding in NoSQL databases? I have to set up virtual machines, set up the software, write scripts to generate testing data, think about how to perform some experiments, and actually run them. Naturally I'd want to collect some data from those experiments; maybe reach a couple of conclusions even. That's a lot of work. It's much easier to just read about it. It's infinitely easier to stumble upon and read an article on "How to Really Get Things Done Using GettingThingsDone2.0 and Reverse Todo Lists" than it is to actually get something done.

The second reason, at least for me, is that it gives me the feeling that I'm learning more about things. In the time it takes me to set up all the stuff above, I could have read who-knows-how-many articles. And it's true in a sense. The information isn't useless per se. I'm learning more shallow knowledge about a lot of different things, versus in-depth knowledge about a few things. It gives me all kinds of cool ideas, things to do, stuff to try out. But I never get around to those things, because I'm always busy reading about something else!

So I have taken drastic measures.

I have removed close to 95% of my feeds from Google Reader. I've blocked access to Reddit and HackerNews so I'm not tempted to read the comments there. I check hackurls.com (an aggregator for Hacker News, Reddit's /r/programming and some other stuff) at most once a day. Anything interesting I see, I send to my tablet (at most two articles a day), which I only read on the train (where I don't have anything better to do anyway). I avoid Wikipedia like the plague.

I distinctly remember being without an Internet connection for about a month almost four years ago. It was the most productive time of my life since the Internet came around. I want to return to the times when the Internet was a resource for solving problems and doing research, not an interactive TV shoveling useless information into my head.

Now if you'll excuse me, I have an algorithm to write and a website to finish.

Read the POSIX standard

Stop reading your local manual pages when programming/scripting stuff, and use the POSIX standard instead:

Online POSIX 2008 (EEE Std 1003.1-2008) standard

There are four main parts:

Some Do's and Dont's:

Finally, read Bash Pitfalls to learn why your shell scripting sucks.

Python UnitTest: AssertRaises pitfall

I ran into a little pitfall with Python's UnitTest module. I was trying to unit test some failure cases where the code I called should raise an exception.

Here's what I did:

def test_file_error(self):
    self.assertRaises(IOError, file('/foo', 'r'))

I mistakenly thought this would work, in that assertRaises would notice the IOError exception and mark the test as passed. Naturally, it doesn't:

ERROR: test_file_error (__main__.SomeTest)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "./test.py", line 10, in test_file_error
    self.assertRaises(IOError, file('/foo', 'r'))
IOError: [Errno 2] No such file or directory: '/foo'

The problem is that I'm a dumbass and I didn't read the documentation carefully enough:


assertRaises(exception, callable, *args, **kwds)
Test that an exception is raised when callable is called with any positional or keyword arguments that are also passed to assertRaises().

If you look carefully, you'll notice that I did not pass in a callable. Instead, I passed in the result of a callable! Here's the correct code:

def test_file_error(self):
    self.assertRaises(IOError, file, '/foo', 'r')

The difference is that this time I pass a callable (file) and the arguments ('/foo' and 'r') that the test case should pass to that callable. self.AssertRaises will then call it for me with the specified arguments and catch the IOError. In the first scenario (the wrong code), the call is made before the unit test is actually watching out for it.