Sunday, August 26, 2012

Archive Random Read Performance in Python

I was curious about the performance of random read access of archive files in Python. I knew ZipFile was okay, and I was curious how it would perform compared to an sqlite3 database, and I threw in Tarfile (gzipped) for extra curiosity. The motivation for this was the pytz package and its nearly 600 small read-only data files. In one system in the past it was annoying to install and manage this menagerie, and in a fit of premature optimization I figure it ought to be better to read from one archive file anyhow, so I scratched this curiosity itch and found out some things.

In short:
Zip archives are pretty good.
Gzipped tar files are smaller than zip archives, but much slower to read (at least in the Python implementation). Sqlite3 databases are slightly faster to read than zip files, and a little bit faster still if you don't compress the data.
Sqlite3 databases are much faster to read the first record. Repeated reading from an already open zip archive or sqlite3 database are relatively close in performance.

Here's some data:

The test file generated 1000 files with names between 5 and 100 characters long and data between 50 and 5000 bytes long (uniform distributions, random filler data).
I cut off each test at 10 seconds, so the slower methods wound up with fewer reads done. The test data was made up of random data which was pretty uncompressable. So, everything was trying to compress the data, and decompress it on read, but the resulting archive file sizes were all pretty consistent. On real data (the pytz files) tar.gz can be half the size of zipfile or sqlite3 (which are similar though sqlite is slightly larger). The reopen=TRUE trials re-open the archive file for read before each random read. So the time reported is to open the file and read one random record. Otherwise the archive is opened once and read many times for all the random reads. This makes a huge difference in performance, but sqlite3 has the best archive startup time.

I have a mercurial archive of the sources I wrote to test all this here:
http://bolson.org/~bolson/sqlitearchive
There is also a utility to convert a directory of files into a sqlite3 database that maps paths to BLOBs for each file.

Monday, September 12, 2011

Git is backwards, sometimes

hg diff -r from -r to
svn diff -r from:to
git diff to..from
Let's say "from" is 1969 and "to" is 2010. I think normally I'd write the earlier thing on the left and the later thing on the right. It's a culture bias of our writing system that I assume.
If I run:
git diff 41af332c2c071d941a0aa90b963e4e499e6b16ed 730c38d1a5c7a35b23245c7f4acb03c292ea0c03
and 41… is earlier than 73… then pretty much what I expect happens. I get a forward diff.
But if I run:
git diff master mybranch
I get a reverse diff, which if applied would undo all the changes in mybranch. I have to run it backwards, `git diff mybranch master` in order to get a forward patch I could use to apply my changes. Why?
I can `svn diff ${repo}/trunk ${repo}/branches/mybranch` and get what I want.
I can `hg diff -r otherbranch -r mybranch` and get what I want.
git is backwards, but worse, it's backwards in this one seemingly random case. WTF.
I see git gaining popularity, and I see git sucking, and I hate that. I hate lousy technology winning. I hated that when it was Windows, or 386, or anything. Git sucks. Use Mercurial or Subversion.

Wednesday, August 31, 2011

Google will never do small projects, and doesn't see why you should either

Google AppEngine announced a new pricing model and the killer is that as soon as an 'app'* costs anything it costs $9 per month. The implication to me is clearly this: never do anything small. This probably won't be a barrier to people building a business on AppEngine, they're probably planning on moving many more dollars per month than $9. But I think it will be a barrier to tinkering and creativity.

In some sense 'never do anything small' could be good advice, but in another sense it is a path to missed opportunity. I have criticized Google's current corporate culture as being unable to do anything small. If they can't service 10000 users on opening day and a million within 6 months they're not interested (and really I think they'd be much more interested in 10 million). I think this means they'll never do anything truly new that would have to go through a phase of being small and experimental and unfinished and creative. Reddit or Twitter or Craigslist won't start there. And Google won't try to. They might acquire the next thing like that, but they won't start it.

I now see this culture polluting AppEngine. Under this new regime, I would not start a new AppEngine product speculatively just to see if something worked. I would not experiment there. Making a business decision, it is still viable, but it's not a place for creativity. Now, I'd almost certainly rather go with an Amazon Web Services EC2 micro instance or mini instance. I have one. It's great. It will host any number of 'apps' for the flat rate per month I'm already paying. Great. I can run any software I want and play there and try things.

(* What is an 'app'? It's a DNS name. reset.appspot.com or www.comicchopshop.com are a couple examples. Now, I can kinda cheat this and build any amount of functionality under that at /foo and /bar, but there are limits to how far it is a good idea and good design to go with that.)

Monday, June 20, 2011

Firmware Update on GoogleIO Samsung Galaxy Tab 10.1"

I went to GoogleIO and they gave me a Samsung Galaxy Tab 10.1"
It's a pretty nice piece of hardware.
Too bad about some of the software.
I have an Android phone to compare to, and in some ways it is better than the tablet. The phone browser has a better bookmark manager.
Today I discovered the lameness of Samsung's firmware update procedure.
1. Go to Settings: About Tablet: Software Update
2. Manually click the button for it to check for updates
3. Make an account with samsung (okay, sure have my email and password I don't care about)
4. click through a few other things to make it actually download the update
5. click through a couple more things to actually do the update while it warns you that you will not be able to make phone calls (not a phone, wifi only) and that the 'phone' will reboot.
6. and here's the part that failed for me twice: the tablet should not be plugged into USB. It will simply fail to try to apply the firmware update with USB plugged into my computer (maybe logicless power would work). So, with 80% charge I called it good enough and ran the update unplugged. It took a couple minutes for the very slow progress bar to pass the firmware update. Actually I wrote this whole blog post while waiting for it to do that. Hmpf.

Thursday, March 4, 2010

Election Reform, IRV and Politics

I don't like Instant Runoff Voting, but I'm a little sad that Burlington VT repealed IRV election of their mayor. Sure, their second IRV run election was a flop, where three different counting methods could find three different winners, demonstrating that all of the anti-IRV FUD, dismissed as vaguely possible mathematical oddities, could actually happen in the real world. Still, I'm a little sad.

I want to see 'election reform', even if it's IRV, go forward and spread. I'll yell and scream at every stage I can that if we're going to do it we should do it right and use something better than IRV, but if that's the compromise I get I'll take it. And I'm afraid that because Burlington had a bad experience with 'election reform' (really all IRV's fault, IMFO), all such efforts will be tarnished. Every establishment politician who wants to raise Fear, Uncertainty and Doubt will be able to point at Burlington and make spooky noises about what terrible things could happen.

I've been focusing on redistricting lately, because programming that has been the more interesting puzzle and because the 2010 Census is making it timely, but I still think that getting people voting on rankings and ratings ballots could be the biggest thing to happen to Democracy since the US Constitution. Making that change may have just gotten a little harder because some people did it badly. (Which does not bode well for the current pretty-bad-compromise Health Care bill. :-/ )

Tuesday, April 29, 2008

blogger needs a 'delete post' feature

hpmf.

Monday, December 10, 2007

Al Gore's Speech Accepting the Nobel Prize

In which he makes the comparison of Global Warming to Hitler and the People In Charge now who do nothing are like the people in charge back then who did nothing or were appeasers.

This nugget is what I think is the most awesome:

"shift the burden of taxation from employment to pollution"

How much of that could we really do? Make income tax lesser and more progressive and tax pollution instead? Hell yeah!

I have something to say