Sunday, August 26, 2012

Archive Random Read Performance in Python

I was curious about the performance of random read access of archive files in Python. I knew ZipFile was okay, and I was curious how it would perform compared to an sqlite3 database, and I threw in Tarfile (gzipped) for extra curiosity. The motivation for this was the pytz package and its nearly 600 small read-only data files. In one system in the past it was annoying to install and manage this menagerie, and in a fit of premature optimization I figure it ought to be better to read from one archive file anyhow, so I scratched this curiosity itch and found out some things.

In short:
Zip archives are pretty good.
Gzipped tar files are smaller than zip archives, but much slower to read (at least in the Python implementation). Sqlite3 databases are slightly faster to read than zip files, and a little bit faster still if you don't compress the data.
Sqlite3 databases are much faster to read the first record. Repeated reading from an already open zip archive or sqlite3 database are relatively close in performance.

Here's some data:
The test file generated 1000 files with names between 5 and 100 characters long and data between 50 and 5000 bytes long (uniform distributions, random filler data).
I cut off each test at 10 seconds, so the slower methods wound up with fewer reads done. The test data was made up of random data which was pretty uncompressable. So, everything was trying to compress the data, and decompress it on read, but the resulting archive file sizes were all pretty consistent. On real data (the pytz files) tar.gz can be half the size of zipfile or sqlite3 (which are similar though sqlite is slightly larger). The reopen=TRUE trials re-open the archive file for read before each random read. So the time reported is to open the file and read one random record. Otherwise the archive is opened once and read many times for all the random reads. This makes a huge difference in performance, but sqlite3 has the best archive startup time.

I have a mercurial archive of the sources I wrote to test all this here:
http://bolson.org/~bolson/sqlitearchive
There is also a utility to convert a directory of files into a sqlite3 database that maps paths to BLOBs for each file.