Tuesday, September 18, 2012

Developing Portably, For the Future

If I were to start a new project today, what language and environment would I write it in? (Assuming I'm not targetting an environment that makes the choice for me (Android, iOS, browser-side javascript).)
So, let's say I'm doing some data processing or some web serving. I'm writing a command line tool or daemon. I think I'd like it to be portable between my Mac and Linux machines that I own. I'd like to not regret my choice of environment if I'm still using this thing 1-5 years from now.
What's in?
C/C++, Python. I am almost ready to concede adding JavaScript to this list, but I don't actually like the language, though I'm fine with using it when I need to for getting things done in browser UIs. These languages I believe will be solidly supported into the future and remain free of deathspiralling suck. The Python 3 transition is going slowly, but I think it could yet turn out okay. I wish Python had static type checking. C++ has lots of stuff added in the last 10 years that I don't want to use.
What's out?
Java - Oracle seems to be killing it. Damn shame, a few years ago I called it my favorite language.
C# - Microsoft actually made a decent language, I've been using it at work lately, and they did some decent things about pushing it into an open-ish standard. But I don't see any compelling reason to use it and I haven't heard that I can rely on the Mono framework to be reliable now and into the future.
PHP, Ruby - I hear nothing but grief about using these. I dislike PHP as a language and an environment. It's amazing that things like mediawiki and drupal are built on it, but I'm disinclined to hack on them because of PHP.
perl. Been there done that. I use Python now.
Any functional language. They make hard what is easy in other languages.
What's left?
Go - Google has a cute little language there. It has my favorite features checked off like: compiled, bounds-checked, and garbage-collected. It's BSD licensed, so no one company can kill it the way Oracle is doing to Java, but I feel like it's not ready yet. Also it's a kinda weird little language and there are a few major language features that I feel if go had them I would be much more productive when using it. Another major language revision, and some growth in the community library support, and hopefully it'll be better a year from now.

Sunday, August 26, 2012

Archive Random Read Performance in Python

I was curious about the performance of random read access of archive files in Python. I knew ZipFile was okay, and I was curious how it would perform compared to an sqlite3 database, and I threw in Tarfile (gzipped) for extra curiosity. The motivation for this was the pytz package and its nearly 600 small read-only data files. In one system in the past it was annoying to install and manage this menagerie, and in a fit of premature optimization I figure it ought to be better to read from one archive file anyhow, so I scratched this curiosity itch and found out some things.

In short:
Zip archives are pretty good.
Gzipped tar files are smaller than zip archives, but much slower to read (at least in the Python implementation). Sqlite3 databases are slightly faster to read than zip files, and a little bit faster still if you don't compress the data.
Sqlite3 databases are much faster to read the first record. Repeated reading from an already open zip archive or sqlite3 database are relatively close in performance.

Here's some data:
The test file generated 1000 files with names between 5 and 100 characters long and data between 50 and 5000 bytes long (uniform distributions, random filler data).
I cut off each test at 10 seconds, so the slower methods wound up with fewer reads done. The test data was made up of random data which was pretty uncompressable. So, everything was trying to compress the data, and decompress it on read, but the resulting archive file sizes were all pretty consistent. On real data (the pytz files) tar.gz can be half the size of zipfile or sqlite3 (which are similar though sqlite is slightly larger). The reopen=TRUE trials re-open the archive file for read before each random read. So the time reported is to open the file and read one random record. Otherwise the archive is opened once and read many times for all the random reads. This makes a huge difference in performance, but sqlite3 has the best archive startup time.

I have a mercurial archive of the sources I wrote to test all this here:
http://bolson.org/~bolson/sqlitearchive
There is also a utility to convert a directory of files into a sqlite3 database that maps paths to BLOBs for each file.