Archaeonumerology*: 2011

Monday, September 12, 2011

This Time I'm the Idiot

I got a call from my wife this morning telling me that one of my old web-development clients had called to say they were getting an error message when trying to update part of their website. Since the function in question has been working perfectly for the better part of three years...the likely problem was obvious, particularly since this feature should have roughly one entry per week, right?

I logged in to their content-management system*, reproduced the error, opened up the mySQL database, saw that yep, there were 127 entries, changed the id field of the table from a TINYINT (8-bit) to a SMALLINT (16-bit), went back to content-management, and made sure the error was gone. Total time, 2 minutes. Except that it wasn't 2 minutes. It took 20.

Why? Because I had to search back through several years of email to find the login and password for the database. Why not just read that straight from the code? Because it would have taken even longer to find the FTP login info (buried on a laptop with a malfunctioning screen that I have with me and on a new laptop that's sitting at home right now).

So, morals of the story:

1. Unless you really have a storage issue, don't use really small field sizes even where it makes sense (e.g., there are tables in the database that have less than 10 rows, and the user has no way to add more)...because you might get in the habit and use them where it doesn't.

2. Put some real planning into where you keep old client info. Is it truly backed-up? Will you be able to get your hands on it quickly from almost anywhere?

*Am I the only one bothered by the fact that basically no client is ever bothered that I retain access to their database, webpage code, and content-management system (any one of which would allow me to do some serious damage). I've tried to tell clients how to go about changing passwords and database settings when I turned over the finished product, but once I tell them that they'll have to let me back in if they need a bug fix, they always just want to leave things as they are. Which I suppose is fine, but if there's enough staff turnover, in a few years it's possible that no one at the client will know I still have such access.

In general, I think this shows a blindspot many people have about computer security. If you ask someone at a bank, say, who has the most "trusted" person (data-security-wise) is, they're likely to say the CEO or President or some such. Of course, it's actually the database administrator or whoever sets up accounts, but he/she's just a peon and doesn't count.

Sunday, August 28, 2011

A Luddite at the BBC

An alleged "Technology Reporter" at the BBC has an absolutely atrocious article about the doom we're all facing from algorithms that are taking over the world.

The first example given is the multi-million dollar used book on Amazon (which I covered in my last post). The "reporter" couldn't even do enough due-diligence journalism to realize that these weren't Amazon's fancy algorithms, but amateur algorithms (one more so than the other) by the used-book sellers. Is it important (to those people) that they know what they're doing? Sure. But it didn't affect anyone else because nobody was dumb enough to buy it.

That's followed by a series of other stupid and/or fear-mongering examples:

Movie-making decisions? If those algorithms go awry, we end up with crappy movies. Is that a new thing?

Google uses secret algorithms to determine which advertisements we see? Does anybody actually pay attention to those ads? If so, are they harmed if they don't get ads that make sense for them? The real concern here is supposed to be data harvesting...which has essentially nothing to do with these supposedly smart algorithms, and is only alluded to in the article.

We've stopped remembering things? We've only been doing that more and more since writing was invented. (If one hasn't seen it, Episode 4 of James Burke's The Day the Universe Changed series--also from the BBC, I should note...from 1985...is excellent on this issue relative to the printing press.) On the more specific issue, of whether search-engines are good or bad, though, I'm with Ta-Nehisi Coates' NY Times op-ed.

Computer-driven trades at the NY Stock Exchange. Doesn't anyone remember 1987? Whether such trades were at fault then or not (it's still not clear), this isn't a new concern, means of dealing with such things have been being developed since then...and unlike a "real" crash due to an asset bubble (e.g. 1929 or 2008), a mistaken crash is far less of a serious problem. Note how quick the recovery was in the 2010 'crash'.

Finally, the last line of the article is so stupid I have to quote it:

As algorithms spread their influence beyond machines to shape the raw landscape around them, it might be time to work out exactly how much they know and whether we still have time to tame them.

Algorithms are spreading their influence? Umm, no. We are spreading our use of algorithms. Work out how much they know? By definition, algorithms don't know anything (except in the sense that they embody a "how"). If you want to fear-monger about databases in the hands of incompetents and malefactors, that's a different story. Whether we still have time to tame them? I can't even begin to describe how clueless that is. Maybe we need to tame the people using the algorithms, but tame the algorithms themselves? I don't even know what that could mean.

I think what most pissed me off about the article is the total lack of any awareness that algorithms could ever be good. The algorithms on the computer that helps my car's engine burn fuel more efficiently? The algorithms used to model organic chemistry and speed the discovery or invention of new medicines? The algorithms that run and interpret the data in an MRI machine? The algorithms that keep airplanes from crashing into each other? The algorithms that allow food distributors to keep people in cities like New York and London fed (both cities typically have less than 48 hours of food on hand)? The algorithms that put a huge fraction of human knowledge and entertainment just a few keystrokes away from anyone wealthy enough to have an internet connection?

As an aside, I'd also like to point out that the used-book-price algorithms, at least, are certainly simpler than the ones used by the software the author wrote her article on.

Every technology has its potential downsides. We've been dealing with that since the first tool was invented. The opening scenes of 2001: A Space Odyssey are hilariously inaccurate to an anthropologist (I've always particularly loved the use of tapirs as ancient African prey animals, though that's one of the smallest problems), but are a familiar reminder of how deep an issue this is. Computers are just the latest tool that not everyone is comfortable with. Playing on those fears, whether out of malice or ignorance, is not something I expected of the BBC.

Saturday, April 23, 2011

Somebody isn't as smart as they think they are...

Via Brad Delong, we are brought the story of competing used-book sellers who both used a pricing algorithm to try to beat the competition (though not in the same way):

...logged on to Amazon to buy the lab an extra copy of Peter Lawrence’s The Making of a Fly – a classic work in developmental biology that we – and most other Drosophila developmental biologists – consult regularly. The book, published in 1992, is out of print. But Amazon listed 17 copies for sale: 15 used from $35.54, and 2 new from $1,730,045.91 (+$3.99 shipping).

It topped out over $23 million before somebody noticed and turned off their algorithm. The sad, and funny, part is that the seller who turned off their algorithm is the one who may have been competent. Their algorithm was setting the price below that of the competition - if they were smart enough to put in a lower limit, then they didn't do anything particularly dumb. (Remembering that other people can be idiots is all too often above and beyond the call of duty.)

The other seller, though, used an algorithm that would set the price above that of the competition...and clearly did not put in a maximum allowed price, leading to the spiraling prices.

Part of what probably happened here is that at least one if not both sellers thought they were the only ones smart enough to employ an algorithm to set prices. Well, they were partly right, I suppose...

If the process had been allowed to go on long enough, it might have been even more fun! Depending on the languages used to write the algorithms, the variable types, etc. and whether or not care was being taken (not to mention the range of values Amazon will accept*), one or both prices might some day have gone to "Overflow Error", "NaN", or even flipped negative!

*We've already learned that Amazon thinks its Marketplace sellers might be selling things for over $20 million. Doesn't seem very likely, does it?

Monday, February 7, 2011

Binary Search is Better than That

I'm teaching a course on Geographic Information Systems this semester. I whined on facebook about a week ago about the surprising (to me) lack of computer skills among my students. Here, I'm going to whine about a math error in the course textbook.

I'm using Michael Demers' Fundamentals of Geographic Information Systems (4th edition). I like the textbook, but alas there's a rather striking math error that I found while prepping today's lecture.

In discussing computer file structures and searching, Demers gives an example of conducting linear search on a 200,000 record dataset: if each check is assumed to take 1 second*, he says, then the maximum time required is about 28 hours (100,000.5 seconds). The minor error here is that this is the expected time, not the maximum. (Maximum is, of course, 200,000 seconds.) The larger error (to my mind) comes up when he then presents binary search (of a sorted file, naturally). In this case, the log₂(n) performance is said to reduce the maximum time to a little over 2 hours.

Now, this may not jump out at you as an obvious error if you're not into things like search algorithms. (For your sake, I hope you're not.) And, I guess I can't expect people to have a feel for logarithmic scales.... But, if your whole point is to emphasize how much faster binary search is than linear search, then this should have seemed a bit long.

How long would one in fact expect the binary search to take? About 18 seconds. Now, that's a result that'll impress the reader: 18 seconds instead of 28 hours!

I don't want to pick on Demers; it's really easy to make some small mistake in the process of creating examples, and not catch it. The people I do want to pick on are the reviewers.

In archaeology, I've noticed a tendency for articles, etc. with above-average quantities of math to make it into print with significant problems in that math. My guess has always been that reviewers are scared of the math and just assume that anyone smart enough to do that math must be right. I sort of figured, though, that people reviewing a GIS textbook would be a little more math-oriented, and that this would have been caught...especially since the same error appears in the third edition! (I can't speak for the first or second editions.) Somebody really should have caught this.

Nonetheless, assuming the author knows what he's doing, how did this error get in there? It looks like the original example was 200 items in the list, producing linear search estimate of 100 comparisons (well, 100.5) and binary search estimate of 7.6 comparisons. Wanting to make the numbers bigger, someone (possibly an editor?) just bumped both by a factor of 1000 (100,000 seconds is 27.8 hours and 7600 seconds is 2.1 hours) and upped the n by the same amount.

Oops! Alas, the whole point is that binary search becomes massively more efficient as the number of items to search increases. Multiplying n by 1000 adds just under 10 to log₂(n).

*The one-second-per rate is clearly chosen for pedagogical simplicity, not as an estimate of actual time required.