*Yeah, yeah. Greek, Latin, who cares?

Tuesday, December 21, 2010

Percentiles

Just a quick whine, while I'm busy prepping for next term:

The AP has an article today worrying that the American education system is failing to prepare students for military service. Now, I strongly suspect that's true...after all, I see how poorly prepared many students are for college.

What I'm annoyed by, though, is the apparent ill-preparedness of the journalist who wrote the damn article. The headline is that 23% of students taking the ASVAB (Armed Services Vocational Aptitude Battery) are scoring too low to be allowed to enlist. The article itself, however, later states that

Recruits must score at least in the 31st percentile on the first stage of the three-hour test to get into the Army or the Marines. Air Force, Navy and Coast Guard recruits must have higher scores.
Does anyone else see the problem here? If you have to beat out 31% (or more) of the other people taking the test, then it literally cannot be possible for less than 31% of the takers to fail the test! The percentage who fail just has to be more-or-less constant (and at least 31%)*.

Now, I'll admit I haven't taken the time to read the study on which the AP article is based, so I'm not casting aspersions on those who carried out the study, or even on those who've expressed concern over its results. But for the love of all that's mathematically possible, can we get journalists who know more about math than my 4-year-old?

*More-or-less constant because the pool of scores from which the percentile-score equivalences are derived is probably multi-year, so the percent who score at or below that value can fluctuate from year to year. At least 31% must fail because that cutoff is described as applying to the first section--if it's possible to fail on the other sections, then some who pass the first may fail overall.

Thursday, September 9, 2010

Energy Schmenergy—what about FOOD?
(Prey Choice, Diet Breadth, and all that jazz)

For those not familiar with the use of optimal foraging models in archaeology, there’s this thing called the prey-choice model, or sometimes the diet-breadth model, that zooarchaeologists like to use in the interpretation of faunal assemblages. Originally developed by ecologists and typically presented in an evolutionary ecology framework, the model basically tells you which food resources an organism should exploit and which it should not if you’re willing to assume the organism is maximizing food-acquisition efficiency. More specifically, it focuses on maximizing the net rate of energetic gain (nice mouthful, huh?).

The prey-choice/diet-breadth model is formulated as an inequality. When true, the resource j should be pursued on encounter; otherwise, it should be bypassed:

The most important thing to bear in mind here is that the resources are ordered by their ei/hi ratios. Resource #1 (i=1) has the highest ei/hi ratio, resource #2 (i=2) has the next highest, etc. With that in mind, what are all these silly letters?

  • ei is the net energetic return of resource i (that is, the energy obtained from consuming the resource minus the about of energy expended in acquiring, processing (if applicable), and consuming the resource)
  • hi is the average handling time of resource i (that is, the amount of time required to obtain the resource once it has been encountered)
  • Ts is the time spent searching for resources to exploit
  • λi is the encounter rate with resource i (how often per unit time the resource is chanced upon)
  • s is the energetic cost (energy expended per unit time) of searching

The model subtracts the calories spent by the forager in acquiring the resource from the forager’s caloric gain from eating the resource and then divides that by the amount of time involved. This “net rate of energetic gain” (the left side of the inequality, and the value on which the resources are ordered – “ranked”) is compared to the overall net rate of energetic gain that would be expected if the forager only exploited more efficient—higher ranked—resources (the right side of the inequality). If the drop in efficiency caused by pursuing a less efficient resource would be outweighed by the efficiency cost of waiting for a more efficient resource to be found, then that less-efficient resource should be exploited and is part of the optimal diet. (Yes, I know that’s kind of confusing if you’re not familiar with it.)

The key point for most zooarchaeological uses is that resources are in or out of the optimal diet depending on the rate at which the more efficient resources are encountered (that is, how long one must search for the ‘better’ resource and how much energy would be expended in the process are the critical factors). Zooarchaeologists commonly use this to interpret faunal assemblages by looking for the addition (more usually, the increased representation) of what are thought to be lower-ranked (less efficient) resources and interpreting that as indicating a reduction in the availability of higher-ranked resources. Some attention is paid to whether or not there might be some environmental change that resulted in this reduction (or, if the lower-ranked resource did not appear, but simply increased in frequency some such change that resulted in increased availability of the lower-ranked resource). Finding no evidence of such environmental change, the inferred reduction in the availability of the higher-ranked resource is attributed to human agency, usually human population growth and associated overhunting of the most efficient resources. I don’t want to into the question of whether or not that logic chain is acceptable here...I’ve got a different axe to grind today:

If you stop and think about it, there’s a problem when it comes to hunting of medium to large animals, like most ungulates: the individual hunter almost certainly can’t eat all of the meat him/herself. And even if it were actually possible for the hunter to do so, because of sufficient ability to store the meat (say, a big freezer at home in the garage), he/she probably won’t actually eat all the meat. Rather, a lot of it—almost certainly a majority—will be shared with others. But what does this mean for the prey-choice model? Shouldn’t we only be including the meat the forager actually ate when we calculate ei? After all, he/she doesn’t really get any energetic benefit from the meat eaten by others (certainly not directly enough for it to be considered in determining the net rate of energetic return from the resource). But in that case, why is the forager going after these big animals so often, as is so frequently the case in, for example, the Middle Paleolithic? (Sure, the model could be inoperative...but we're assuming that at least something similar is going on.) There are some fairly easy answers to that question, such as the showing-off hypothesis or reciprocity with others doing the same thing, but we’re supposed to be using the prey-choice model here, which is silent on these topics.

What is to be done? Well, why not think about a slightly different formulation of the prey-choice model, one which fits this sort of behavior better, and in fact seems to match up better with the way archaeologists actually apply the model? Instead of maximizing the forager’s personal net energetic return rate, we’ll try maximizing the forager’s meat acquisition rate (we’re restricting ourselves to hunting here) . In doing so, we are implicitly (well, I guess it’s explicit now that I’m talking about it) assuming that the value of meat actually consumed by the forager and meat acquired and shared with others are the same. In cases where personal survival is at stake, this obviously isn’t likely to be the case, but it should be a reasonable approximation in a reciprocity situation and not too unreasonable—I hope—in a prestige situation. If nothing else, it should be a better fit for reciprocity or prestige than calories are!

Math warning!!! (Skip to here if you’re willing to take my word for the math.) This modification of the model involves replacing the net energetic return with the raw meat yield (there is no meat cost, so we’re no longer talking about a “net” value) and removing the subtraction of energetic cost of search from the right-side numerator, since we are only worried about the time, not the energy, expended in searching for prey. The revised equation looks like this:

Again, resources are ranked in order from highest to lowest ratios of food yield to handling time (yi/hi) so that all resources i such that i < j are higher-ranked than resource j. yi is the meat yield per engagement (encounter and pursuit) value, replacing ei, the net energetic gain per engagement value. Other terms are as listed previously. One really nice thing about this formulation is that the lack of the energetic cost of search factor in the numerator means that it can be simplified a lot more easily than the standard version. To do so, we first cancel out the search time terms:

Next, we define some substitutions:

defines an overall encounter rate with resources more highly ranked than resource j.

defines an encounter-rate-weighted average yield. Each higher-ranked resource’s yield is weighted by how often it is encountered. This can thus be thought of as the average (and thus expected) yield of the next encounter with a higher-ranked resource.

does the same thing for handling time. Once we have these terms defined, we can substitute them into the food-yield prey-choice model equation:

Dividing top and bottom of the right side by Λj converts this to:

This formulation makes it much more clear how the prey-choice model works. 1/Λj is simply the average time until the next encounter with a resource ranked higher than resource j. Thus, resource j should be pursued on encounter if its yield to handling time ratio is higher than the ratio of the expected yield of the next-encountered higher-ranked resource to the time required to first encounter and then handle that higher-ranked resource.

The standard prey-choice model works the same way, but with the complication of the energetic cost of search, the impact of which is hard to wrap one’s head around. As a general comparison, the food-yield version predicts (even if we assume that the consumption issues vis-à-vis energy discussed earlier are not operative) higher average efficiency of bypassing a given resource in favor of later encounters with higher-ranked ones (because the energy expended during search is not subtracted) and thus higher efficiency thresholds for the inclusion of lower-ranked resources. Meaning: the food-yield version predicts a greater focus on larger resources.

More general benefits of the food-yield version of the prey-choice model include not only the conversion to more readily understood (and measured!) characteristics of resources and foragers, but also a renewed emphasis on terms other than encounter rates as explanations for change. Neither yield nor handling time is necessarily a constant attribute of a resource, topics I will return to in the future.

NOTE: This is an informally written “zero-th” draft of something I’ve been messing with for some time. I have a couple of more application-oriented issues (alluded to in that last sentence) in mind that develop from this formulation of the prey-choice model...but I have been unable so far to effectively cram the model (re)development in with the substance of either one of those issues. What I’m mostly looking for here is any feedback on whether or not a formalized version of this would work as a standalone article (that is, much as it appears here, without any fleshed-out applications.

Thursday, September 2, 2010

Got the Academic Job Market Blues? Let's Try a Draft

NOTE: I'm not sure if this is a draft (a Draft draft!) of something to maybe be sent to the publication formerly known as the SAA Bulletin...or just a rant (a daft Draft??)

We all know, whether we admit it to ourselves or not, that the academic job market is not really all that merit-based. Oh, it certainly helps to have lots of good publications, a good teaching record, grant money, an on-going research project, et cetera ad nauseam. But there are no guarantees.

You could be two years out of a top graduate program, with a top post-doc and a year as a Visiting Assistant Professor under your belt, a sole-authored star-treatmented (positively, no less!) Current Anthropology article, a handful of articles in American Antiquity, Journal of Anthropological Archaeology, Journal of Archaeological Science, and regional journals, a book contract, a $200,000 grant, and glowing letters of recommendation from respected members of the field...and still not get a job.

You might not even get an interview for a job that looked like it was written for you; a job that you later learn went to an ABD with one article in press and no teaching experience.

Am I describing a real situation that I (or an acquaintance) have been through? No, but the story remains all too plausible, simply because there is a huge element of randomness involved in the job market. That job that looked like it was written for you? Maybe they said North America, but they really meant U.S. Southeast. Maybe they really wanted someone with a local project that could serve as a fieldschool right away, but your work is three states over. Maybe they don't think lab types are "real archaeologists." Maybe they're a hoity-toity liberal arts college and, however impressed they are with your grad school, can't imagine hiring someone who went to Southwestern Central State U as an undergrad. Maybe they took one look at your C.V. and said, "She's too good; we'd never be able to keep her." Maybe their department hasn't hired a woman in the thirty years they've been in existence, and A) isn't about to start now, or B) is starting to get embarassed about it--either way, you could be screwed. Maybe one of the search committee members was rejected when they applied for graduate admission to your grad program many years ago and has nursed a grudge ever since. Maybe no one in the department has been on the job market in thirty years and figures there must be something wrong with you since you haven't gotten a job already. Maybe it was Harvard and your record is so good they knew they couldn't get away with not tenuring you when the time came. The possibilities are endless.

The worst thing about the job market in archaeology (I'm sure it's like this in some other fields, too), in my not so humble opinion, is the uncertainty. The uncertainty on the part of the applicant: "Why didn't they consider me? Am I pathetic, or just not a good fit?" and the uncertainty on the part of the search committee: "How strong a candidate do we really have a shot at getting and keeping?" I'm convinced the latter happens a lot, particularly at smaller schools. There have been too many cases where I've had friends with great research, publication, and teaching records (and myself, too, though I don't fit that description) apply for a job at some little crappy school (well, and sometimes a not-so-little, not-so-crappy school)—and for none of us to get so much as a request for letters or a phone interview...and then for the school to end up hiring someone with no record to speak of—presumably because that's what's normal there. It never occurs to the search committee that jobs are so hard to come by that even extremely strong candidates would be thrilled to take the job.

So, what's the solution? I don't think there is one, but I'd like to put forward an only slightly tongue-in-cheek proposal: the SAA Draft. Like the NFL draft, or the NBA draft--but probably not like the MLB draft, since we don't have a farm system in academia.

So, the proposal:

Each year, early in the fall semester, those archaeologists who want a job for the following academic year enter their names in the draft by submitting generic research and teaching statements, CVs, and letters of recommendation. Slightly later, say by the end of October (to allow schools about to lose someone—see below—to replace them), colleges and universities submit job packages to the draft-running organization, presumably the Society for American Archaeology (SAA). The job package would include salary and benefits, start-up costs, ongoing research support, teaching load, and so forth. A committee empanelled by SAA (perhaps elected by the membership, perhaps appointed by the SAA President) would convene over winter break, and rank the job packages. They would be able to take into account not only the information presented by the school, but also the school's and department's reputations, location (in terms of cost-of-living, etc.), prestige, and so forth. The resulting rankings would determine the draft order.

The time between the release of the draft order and the SAA Annual Meeting would give schools an opportunity to conduct any interviews they thought were worth their time and money to help them figure out who they want to draft, much the same way NBA and NFL teams bring in prospective draftees for private workouts. The prospective draftees, themselves, could also work to bring themselves to the notice of their preferred destinations, though they would have to accept the risk of coming off badly.

At the SAA meetings, there could be some time on Thursday and Friday for last-minute interviews and such, but then, on Saturday, the President of the SAA would step to a microphone and intone, "With the first pick in the 2012 SAA draft, the University of _________ selects __________ from the University of ____________. _________ College has ten minutes to make their selection."

Put in the top job package, and you are 100% guaranteed to get the person you want the most. Put in a weaker package, and you might have to settle for someone you were ambivalent about. But if you're sitting there with the last pick, you don't have to wonder, "How good a researcher/teacher can we get?" You can get anyone you want who still hasn't been selected.

From the job-seeker's perspective, merit becomes a little more obviously relevant. The public nature of the system means that you can look at the previous several years' results and see what kind of record is important to the kind of school you want a job at. Each school is still going to have their own particular needs and wants, but they'll have to weigh those in relation to who's out there. Do you pick the best paleoethnobotanist because that's what you feel your department needs, or do you pick the geoarchaeologist who is blowing everyone away? That's going to depend on a lot of school-specific factors, of course. But on the other hand, the geoarchaeologist who's blowing everyone away is going to get a job, even if none of the schools that put together job packages back in October were thinking at the time that they wanted a geoarch person. Some schools will pick on need, some on 'best-available', but they result is likely to be that merit starts to matter more than random craziness.

(Among other things, everyone knows that Stupid University passed up on Rising Star Zooarchaeologist to pick Iffy Lithics Analyst, with the result that Lucky College got R.S. Zooarchaeologist in the biggest draft-day steal since eventual two-time MVP Steve Nash went fifteenth in the 1996 NBA draft. Never underestimate the power of derision.)

Of course, there would have to be some significant rules to keep the system from being abused. There would be a big problem if Pretty Good University got what they thought was a steal with the 14th pick in 2012, but said pick decided his new grant would let him move up and entered the 2013 draft. Worse, suppose the person Pretty Good University drafted didn't take the job? They're left in the lurch, as their second choice may have gotten drafted by Middling College with the 19th pick. The answer, I think, is set length contracts which the draftee is obligated to sign.

(I'd like to think that anyone entering the draft two or more times in rapid succession would be seen as too big a risk, and that the system would thus be self-correcting, but I'm too cynical. I could be wrong, though, so I’m far from dogmatic on this point.)

I think four-year contracts would be about right. The decision whether to go back in the draft or to stay and try to get tenure at the current institution would be made after the most common time for pre-tenure review (most schools do a halfway-to-tenure review, which sometimes includes a possibility of termination). The school gets a guaranteed four years of work out of the draftee, and the job seeker doesn't have to compete for an entry-level job with too many people who are already assistant professors. The draftee gets to decide whether or not to go back into the draft after a major pre-tenure review and at the time when she would be negotiating her new contract.

The school that drafted you and for whom you have worked 80-hour weeks for three years doesn't want to give you a good raise to get you to stay? Reenter the draft. They're incapable of understanding the value of your research? Reenter the draft. Your colleagues have driven you nuts for three years? Reenter the draft.

"Welcome to the 2013 SAA draft, live from Honolulu, Hawaii. Hot Shit University is on the clock!"

DISCLAIMER: SAA doesn't have and isn't going to get an anti-trust exemption from congress, so the whole thing would have to be voluntary. Both schools and applicants would be free to continue using the current system, though I'd like to think the draft would tend to relegate such hiring to post-tenure jobs.

NOTE: I like the idea of a reverse draft even better, where the committee ranks the job seekers, who then pick their jobs in order...but I'm already asking for too much.

Wednesday, September 1, 2010

Recursion Pedagogy (or, How to Make Novice Programmers Crash their Machines)

It's fairly common to find programming books that teach recursion using the Fibonacci sequence as the main example. The obvious reasons for this are that the Fibonacci sequence is easy to understand (each value is just the sum of the preceding two values, with the first and second values in the sequence defined to both be 1) and that this simple dependence makes for massively simple recursion calls and tests. The recursive function, in fact, seems almost magical (code fragments usually in C# unless otherwise noted):

int RecursiveFibonacci(int nth)
{
    if (nth < 3)
    {
        return 1;
    }

    return RecursiveFibonacci(nth - 2) + RecursiveFibonacci(nth - 1);
}

That's it. That function will theoretically get you any number in the Fibonacci sequence (well, only up to the 32-bit limit, but that's easy enough to change).

Fibonacci numbers, however, are an incredibly bad candidate for recursion!

The function is short and simple, which is good; very easy to understand, which is better, and so massively inefficient that if you naively test it by asking for, say, the 50th Fibonacci number you're likely to spend some time trying to figure out where the bug is since it appears to have crashed!

What's actually happening, however, is that you're making an insane number of function calls, thus creating an insane number of function stack frames, each taking up at least eight bytes (if you're using 32-bit integers).

Calculating the 50th Fibonacci number by hand requires adding two numbers 48 times. Calculating the 50th Fibonacci number by recursion requires calling that function above 25.2 B-B-B-B-Billion times!! (Which requires more than 200 Gigabytes of memory just for the stack frames.) What's going on here? Well, we're looking at what I call the (w)Rec(k)Fib sequence:

1 1 3 5 9 15 25 41 67 109 177 287 465 753 1219 1973 ...

which looks very much like the Fibonacci sequence:

1 1 2 3 5 8 13 21 34 55 89 144 233 377 610 987 ...

Every value in the (w)Rec(k)Fib sequence is twice its Fibonacci counterpart minus one. But that's not the most meaningful way to come up with it; rather, each value is the sum of the previous two (like Fibonacci) plus one more. So where Fibonacci is F(n) = F(n-1) + F(n-2), (w)Rec(k)Fib is w(n) = w(n-1) + w(n-2) + 1. It's also the number of times the recursive function is called to determine its Fibonacci sequence counterpart. When the function is called for, say, the fourth Fibonacci number, that's one call, to which we add the number of calls (3) for the third Fibonacci number and the number of calls (1) for the second Fibonacci number. In other words, we have to call the function far more times (twice minus one) than the value of the number we're seeking. Which means that even with modern processor speeds and memory sizes, and virtual memory, we can take an incredibly long time (or fail altogether). Want to calculate the hundredth Fibonacci number? (It's about 3.5E+20.) If you happen to have access to a computer that can manage a billion function calls per second (more actual operations, of course), it'll take around 22,500 years. But you can probably do it by hand in an hour or so (a couple if you're being careful not to miss a carry).

So, what's the right way to do it? Well, the simplest is through basic iteration, just like you'd do it by hand:

int IterativeFibonacciNumber(int nth)
{
    if (nth < 3)
    {
        return 1;
    }

    int a = 1;
    int b = 1;

    int temp;
    for (int n = 3; n <= nth; n++)
    {
       temp = b;
       b = b + a;
       a = temp;
    }

    return b;
}

It's not nearly as elegant. Nothing about it looks magical. It's not even all that easy to understand, though better variable names would obviously help. But it'll calculate the fiftieth Fibonacci number in about 4 milliseconds on my 2007-vintage laptop, running inside Visual Studio in Debug mode...which is where most of that 4 milliseconds comes from. How do I know? Switch most of the integer variables to long integers, and it'll calculate the 92nd Fibonacci number (the highest storable in a 64-bit integer) in the same 4 milliseconds.

Never, ever, ever use recursion if there's a straightforward way to accomplish the task without it. And, should you conclude that recursion is the way to go, try to at least get a feel for how the function behaves. If, as in this case, you're looking at O(c^n), then you almost certainly need to look again...or conclude that the problem is not practically soluble. (Unless the maximum n is small, of course.)

UPDATE: I should add, of course, that the quickest way is actually to open Excel, type 1 in the top left cell, type 1 again in the cell below it, then type =A1+A2 in the cell below that, and drag down on the box at the bottom-right corner of the cell. You'll get up to the 54th Fibonacci number with the full set of digits. Past that, you'll get floating point values with an ever-decreasing set of digits up to the 1476th Fibonacci number (1.307E+308). If you want the 'real' answer or values beyond that, you're going to have to set up your own kilobit-plus integer (or play with some identities that allow easier calculation of large Fibonacci numbers - see wikipedia).