*Yeah, yeah. Greek, Latin, who cares?

Monday, February 7, 2011

Binary Search is Better than That

I'm teaching a course on Geographic Information Systems this semester. I whined on facebook about a week ago about the surprising (to me) lack of computer skills among my students. Here, I'm going to whine about a math error in the course textbook.

I'm using Michael Demers' Fundamentals of Geographic Information Systems (4th edition). I like the textbook, but alas there's a rather striking math error that I found while prepping today's lecture.

In discussing computer file structures and searching, Demers gives an example of conducting linear search on a 200,000 record dataset: if each check is assumed to take 1 second*, he says, then the maximum time required is about 28 hours (100,000.5 seconds). The minor error here is that this is the expected time, not the maximum. (Maximum is, of course, 200,000 seconds.) The larger error (to my mind) comes up when he then presents binary search (of a sorted file, naturally). In this case, the log2(n) performance is said to reduce the maximum time to a little over 2 hours.

Now, this may not jump out at you as an obvious error if you're not into things like search algorithms. (For your sake, I hope you're not.) And, I guess I can't expect people to have a feel for logarithmic scales.... But, if your whole point is to emphasize how much faster binary search is than linear search, then this should have seemed a bit long.

How long would one in fact expect the binary search to take? About 18 seconds. Now, that's a result that'll impress the reader: 18 seconds instead of 28 hours!

I don't want to pick on Demers; it's really easy to make some small mistake in the process of creating examples, and not catch it. The people I do want to pick on are the reviewers.

In archaeology, I've noticed a tendency for articles, etc. with above-average quantities of math to make it into print with significant problems in that math. My guess has always been that reviewers are scared of the math and just assume that anyone smart enough to do that math must be right. I sort of figured, though, that people reviewing a GIS textbook would be a little more math-oriented, and that this would have been caught...especially since the same error appears in the third edition! (I can't speak for the first or second editions.) Somebody really should have caught this.

Nonetheless, assuming the author knows what he's doing, how did this error get in there? It looks like the original example was 200 items in the list, producing linear search estimate of 100 comparisons (well, 100.5) and binary search estimate of 7.6 comparisons. Wanting to make the numbers bigger, someone (possibly an editor?) just bumped both by a factor of 1000 (100,000 seconds is 27.8 hours and 7600 seconds is 2.1 hours) and upped the n by the same amount.

Oops! Alas, the whole point is that binary search becomes massively more efficient as the number of items to search increases. Multiplying n by 1000 adds just under 10 to log2(n).

*The one-second-per rate is clearly chosen for pedagogical simplicity, not as an estimate of actual time required.

1 comment:

  1. Yes, this is indeed pathetic. However, I would question the value of presenting binary search in a GIS book to begin with: in order to effectively search on a criterion you need to first sort *by that criterion*. I have a hard time imagining that geographical data would need searching on a simple and predictable criterion. I can easily see, however, people reading the book and thinking that binary search is the best thing since sliced bread (which it may even be, but that's besides the point), and doing sorting first (hello, n*log n) and *then* doing "efficient" searching.

    I also thought that we might want to have a searching competition once we start running low on good problems for sorting :-)

    Elena

    ReplyDelete