Increasing performance of bulk updates of large tables in MySQL

Wednesday, November 6th, 2013

I recently had to perform some bulk updates on semi-large tables (3 to 7 million rows) in MySQL. I ran into various problems that negatively affected the performance on these updates. In this blog post I will outline the problems I ran into and how I solved them. Eventually I managed to increase the performance from about 30 rows/sec to around 7000 rows/sec. The final performance was mostly CPU bound. Since this was on a VPS with only limited CPU power, I expect you can get better performance on some decently outfitted machine/VPS.

The situation I was dealing with was as follows:

About 20 tables, 7 of which were between 3 and 7 million rows.
Both MyISAM and InnoDB tables.
Updates required on values on every row of those tables.
The updates where too complicated to do in SQL, so they required a script.
All updates where done on rows that were selected on just their primary key. I.e. WHERE id = …

Here are some of the problems I ran into.

Python’s MySQLdb is slow

I implemented the script in Python, and the first problem I ran into is that the MySQLdb module is slow. It’s especially slow if you’re going to use the cursors. MySQL natively doesn’t support cursors, so these are emulated in Python code. One of the trickiest things is that a simple SELECT * FROM tbl will retrieve all the results and put them in memory on the client. For 7 million rows, this quickly exhausts your memory. Real cursors would fetch the result one-by-one from the database so that you don’t exhaust memory.

The solution here is to not use MySQLdb, but use the native client bindings available with import mysql.

LIMIT n,m is slow

Since MySQL doesn’t support cursors, we can’t mix SELECT and UPDATE queries in a loop. Thus we need to read in a bunch of rows into memory and update in a loop afterwards. Since we can’t keep all the rows in memory, we must read in batches. An obvious solution for this would be a loop such as (pseudo-code):

offset = 0
size = 1000
while True:
    rows = query('SELECT * FROM tbl LIMIT :offset, :size"
    for row in rows:
        # do some updates
    if len(rows) < size:
        break
    offset += size

This would use the LIMIT to read the first 1000 rows on the first iteration of the loop, the next 1000 on the second iteration of the loop. The problem is: in MySQL this becomes linearly slower for higher values of the offset. I was already aware of this, but somehow it slipped my mind.

The problem is that the database has to advance an internal pointer forward in the record set, and the further in the table you get, the longer that takes. I saw performance drop from about 5000 rows/sec to about 100 rows/sec, just for selecting the data. I aborted the script after noticing this, but we can assume performance would have crawled to a halt if we kept going.

The solution is to use the order by the primary key and then select everything we haven’t processed yet:

size = 1000
last_id = 0
while True:
    rows = query('SELECT * FROM tbl WHERE id > :last_id ORDER BY id LIMIT :size')
    if not rows:
        break
    for row in rows:
        # do some updates
        last_id = row['id']

This requires that you have an index on the id field, or performance will greatly suffer again. More on that later.

At this point, +SELECT+s were pretty speedy. Including my row data manipulation, I was getting about 40.000 rows/sec. Not bad. But I was not updating the rows yet.

Connection settings

The next things I did is some standard tricks to speed up bulk updates/inserts by disabling some foreign key checks and running batches in a transaction. Since I was working with both MyISAM and InnoDB tables, I just mixed optimizations for both table types:

db.query('SET autocommit=0;')
db.query('SET unique_checks=0; ')
db.query('SET foreign_key_checks=0;')
db.query('LOCK TABLES %s WRITE;' % (tablename))
db.query('START TRANSACTION;')

# SELECT and UPDATE in batches

db.query('COMMIT;')
db.query('UNLOCK TABLES')
db.query('SET foreign_key_checks=1;')
db.query('SET unique_checks=1; ')
db.query('SET autocommit=1;')

I must admit that I’m not sure if this actually increased performance at all. It is entirely possible that this actually hurts performance instead. Please test this for yourselves if you’re going to use it. You should also be aware that some of these options bypass MySQL’s data integrity checks. You may end up with invalid data such as invalid foreign key references, etc.

One mistake I did make was that I accidentally included the following in an early version of the script:

db.query('ALTER TABLE %s DISABLE KEYS;' % (tablename))

Such is the deviousness of copy-paste. This option will disable the updating of non-unique indexes while it’s active. This is an optimization for MyISAM tables to massively improve performance of mass INSERTs, since the database won’t have to update the index on each inserted row (which is very slow). The problem is that this also disables the use of indexes for data retrieving, as noted in the MySQL manual:

While the nonunique indexes are disabled, they are ignored for statements such as SELECT and EXPLAIN that otherwise would use them.

That means update queries such as UPDATE tbl SET key=value WHERE id=1020033 will become incredibly slow, since they can no longer use indexes.

MySQL server tuning

I was running this on a stock Ubuntu 12.04 installation. MySQL is basically completely unconfigured out of the box on Debian and Ubuntu. This means that those 16 GBs of memory in your machine will go completely unused unless you tune some parameters. I modified /etc/mysql/my.cnf and added the following settings to improve the speed of queries:

[mysqld]
key_buffer         = 128M
innodb_buffer_pool_size = 3G

The key_buffer setting is a setting for MyISAM tables that determines how much memory may be used to keep indexes in memory. The equivalent setting for InnoDB is innodb_buffer_pool_size, except that the InnoDB setting also includes table data.

In my case the machine had 4 GB of memory. You can read more about the settings on these pages:

Don’t forget to restart your MySQL server.

Dropping all indexes except primary keys

One of the biggest performance boosts was to drop all indexes from all the tables that needed to be updates, except for the primary key indexes (those on the id fields). It is much faster to just drop the indexes and recreate them when you’re done. This is basically the manual way to accomplish what we hoped the ALTER TABLE %s DISABLE KEYS would do, but didn’t.

UPDATE: I wrote a better script which is available here.

Here’s a script that dumps SQL commands to drop and recreate indexes for all tables:

#!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
# DANGER WILL ROBINSON, READ THE important notes BELOW
#!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
import _mysql
import sys

mysql_username = 'root'
mysql_passwd = 'passwd'
mysql_host = '127.0.0.1'
dbname = 'mydb'

tables = sys.argv[1:]
indexes = []

db = _mysql.connect(user=mysql_username, passwd=mysql_passwd, db=dbname)
db.query('SHOW TABLES')
res = db.store_result()
for row in res.fetch_row(maxrows=0):
    tablename = row[0]
    if not tables or tablename in tables:
        db.query('SHOW INDEXES FROM %s WHERE Key_name != "PRIMARY"' % (tablename))
        res = db.store_result()
        for row_index in res.fetch_row(maxrows=0):
            table, non_unique, key_name, seq_in_index, column_name, \
            collation, cardinality, sub_part, packed, null, index_type, \
            comment, index_comment = row_index
            indexes.append( (key_name, table, column_name) )

for index in indexes:
    key_name, table, column_name = index
    print "DROP INDEX %s ON %s;" % (key_name, table)

for index in indexes:
    key_name, table, column_name = index
    print "CREATE INDEX %s ON %s (%s);" % (key_name, table, column_name)

Output looks like this:

$ ./drop_indexes.py
DROP INDEX idx_username ON users;
DROP INDEX idx_rights ON rights;
CREATE INDEX idx_username ON users (username);
CREATE INDEX idx_perm ON rights (perm);

Some important notes about the above script:

The script is not foolproof! If you have non-BTREE indexes, if you have indexes spanning multiple columns, if you have any kind of index that goes beyond a BTREE single column index, please be careful about using this script.
You must manually copy-paste the statements into the MySQL client.
It does NOT drop the PRIMARY KEY indexes.

Conclusions

In the end, I went from about 30 rows per second around 8000 rows per second. The key to getting decent performance is too start simple, and slowly expand your script while keeping a close eye on performance. If you see a dip, investigate immediately to mitigate the problem.

Useful ways of investigation slow performance is by using tools to unearth evidence of the root of the problem.

top can tell you if a process is mostly CPU bound. If you’re seeing high amounts of CPU, check if your queries are using indexes to get the results they need.
iostat can tell you if a process is mostly IO bound. If you’re seeing high amounts of I/O on your disk, tune MySQL to make better use of memory to buffer indexes and table data.
Use the EXPLAIN function of MySQL to see if, and which, indexes are being used. If not, create new indexes.
Avoid doing useless work such as updating indexes after every update. This is mostly a matter of knowing what to avoid and what not, but that’s what this post was about in the first place.
Baby steps! It took me entirely too long to figure out that I was initially seeing bad performance because my SELECT LIMIT n,m was being so slow. I was completely convinced my UPDATE statements were the cause of the initial slowdowns I saw. Only when I started commenting out the major parts of the code did I see that it was actually the simple SELECT query that was causing problems initially.

That’s it! I hope this was helpful in some way!