Sunday, June 10, 2012

A Few Thoughts on High-Stakes Testing

One of the Facebook pages I “like,” “Wear Red for Ed,” posted a link to this article by Dan DeWitt, “FCAT pressure helps students and teachers to achieve,” published on the website (at least) of the Tampa Bay Times. WRFE asks, quasi-rhetorically, “Anyone care to comment on this article?” Well, yes; yes, I would.

Not that I haven’t done so, in effect, on numerous occasions in the past (in one way or another, I’ve written about one of the various forms of standardized testing here, here, here, here, here, here, and here), but on the day when my alma mater ruined the good news of giving an honorary degree to one of my heroes, Johnny Clegg, by giving one also to Teach for America pseudo-educator Wendy Kopp, another chance to stand up for real education cannot be allowed to pass by.

OK, I’m no authority on Florida’s specific version of high-stakes testing, but I’ve seen enough in other states to have a reasonable understanding of the issues. First, let’s clarify some terms. There are two kinds of high-stakes testing: one kind—the ACT, the AP and the SAT and all their respective variations on the theme—serves a useful purpose, even if the instrument is flawed. As I have said repeatedly in the past, I have a voice in how our departmental scholarship money is spent, and having some idea of how to compare students of fundamentally different backgrounds is very helpful. True, I need to be aware of the fact that scores skew towards students who are affluent, male, white, and go to good schools. I need to know, too, that cheating is far more pervasive than the testing agencies admit. But as a part of an evaluation of a prospective student’s potential, these tests are probably a net positive.

The other kind of high-stakes testing is more insidious. Students prepare all year for a single test, and their ability to pass on to the next grade level is contingent upon a satisfactory score. Worse yet, teachers and school districts are held accountable, not for whether their students actually learned anything this year, but whether they did well on that exam.

Even apart from the expense, the possibility of corruption, the vindictiveness of those who would see public education perish altogether, the tests (wait for it) are at best a partial indicator of a student’s achievement. Discounting an entire year’s work based on a single test score is like criticizing Michael Jordan for missing a single jump-shot.

More importantly, whereas scores have “improved” under FCAT, this is a meaningless statistic in all sorts of ways, not least that, in the words of a Miami Herald headline last month, “After FCAT scores plunge, state quickly lowers the passing grade.” Yeah, see, more people are passing because not enough were, so we decided that the test was too hard. Now more students are passing, so we must be doing great.

Moreover, it only stands to reason, DeWitt’s protestations to the contrary notwithstanding, that students will do better on a test when they know what is going to be on it, and what strategy to employ. I recall an incident in my own past: I’m pretty sure I’ve written about this before, so please forgive me if you’ve heard this. I was pretty good at math when I was in high school: good enough that I was one of a handful of students chosen to take some exam distributed by the Mathematical Association of America or some such organization.

I should mention here that there are three fundamental ways of scoring a multiple choice exam: the standard model (your score reflects only how many questions you get right, thereby encouraging guessing), the SAT model (your score is lowered slightly for incorrect answers, meaning that random guessing is discouraged, but if you can narrow the field a little, it’s to your advantage to guess), and the Jeopardy model (you lose as many points for a wrong answer as you get for a right one). So, hypothetically, a student who takes a 50-question test, gets 30 right, 10 wrong, and leaves 10 blank, would get 30, 27, and 20 points, respectively. Of course, this student would be compared to other students who tests were scored the same way, so a 25 on the Jeopardy model might be “better” than a 30 on the standard model.

This test was on the Jeopardy model, up to and including have more points at stake for harder questions. I took the test as a junior. I got a negative score. Yes, a negative score. I had more incorrect answers than correct ones, or at least I had more points worth of wrong responses than of right ones. A year later, having endured about ¾ of a school year with the worst (for me) math teacher I ever had and ¼ of a year with the next-worst, I took the test again. I got the highest score in the history of my school’s participation in the program. Did I know any more math as a senior than I did as a junior? Maybe a little. But what I really learned was how to approach the test: not to guess, to skim over the test to see if there were high-point questions I was confident in my ability to handle, to completely ignore low-point questions I wasn’t absolutely sure about, and so on.

This is the lesson of high-stakes testing, and I consciously employ it in reverse in my freshman-level classes at my university. I know students have been taught to look for keywords like “always” or “never” on exams, figuring that they’re keys to test-taking. Few things are “always” or “never” true, so “usually” or “sometimes” are, in Bubbletestland, nearly always (see?) the correct answer. I therefore am careful to include a question or two for which the right response is indeed “always” or “never.” I also like to go 10 or 15 questions in a row with no C’s, or with 6 B’s in a row, or whatever. It generally doesn’t take too long to convince students that understanding the material is actually a superior strategy to trying to out-think the test.

This is a struggle, however, for students who, increasingly, want to be told the “answer,” not a means of arriving at the truth. I periodically get a course evaluation from an irate student who objects to my asking questions “backward” (not “who was Marlowe?”, but “who was the most important English pre-Shakespearean playwright?”) or that I’d actually expect both a definition and an example of a term.

Corollary to this is the simple fact that test scores measure only test scores, not skills. It is more than a little troubling that when a year’s worth of classes and a day’s worth of test-taking yield different results, we seem to concede the superiority of the test-providers’ commercial product, especially given that the tests are often both written and graded by people who couldn’t get jobs as teachers. So the fact that test scores are going up, even when it’s true (unlike in Florida), means precisely that: test scores are going up. Let’s not pretend it means anything else.

Moreover, be it noted that Advanced Placement tests are an entirely different phenomenon. How many students take them, or how well they do on them, is completely unrelated to basic skills tests. The same do-well-on-the-exam mentality may prevail, but the students taking AP exams aren’t the ones who are fretting about summer school if they don’t pass the Big Scary Test at the end of the year. It’s a good thing if more students are prospering on the AP, but no one who understands education or educational testing sees much of a correlation between that process on the one hand and the FCAT and similar projects on the other, although of course the same person might embrace both strategies.

Finally, there’s the inane argument that “for these tests to have any meaning at all, good scores must be rewarded and poor ones punished. High stakes are the whole point.” Boy, does that sound better in theory than in practice. Yes, it is reasonable to have some sort of standardized testing, and it is reasonable to have scores on those tests count in some appreciable way. Scores could factor into a student’s class rank, could be used internally to compare teachers, etc. But there’s a huge caveat here: there are enormous variations in student competencies even among what would normally be described as the same population, meaning that measuring teachers or districts by student performance on these tests is even more fraught with peril than measuring students by that imperfect yardstick is.

I have occasionally taught two sections of the same class in the same semester. Not infrequently, one class is great and the other horrible, at least in relative terms. Like other universities, we are subject to our own form of legislative meddling, namely “assessment.” I need to compile statistics about how well how many of my students do on certain prescribed tasks and assignments. Compare the results of the assessments from my 2011 and 2010 Play Analysis classes without providing for context, and you’ll think I suddenly learned how to teach (after 30+ years in the classroom). Those scores sure looked a lot better. Of course, as a group, the 2011 classes were comprised of better students: more intelligent, more self-motivated, more engaged… and they came to class more often. Funny thing, they did better. There were, of course, some good students in 2010 and some lesser ones in 2011, but as groups there was no comparison.

I know this to be utterly commonplace. I know this, of course, because, unlike Jeb Bush or Dan DeWitt, I am an educator. I know that any assessment of my abilities in the classroom that doesn’t take into account what I’ve got to work with is, by definition, useless at best and counter-productive at worst. I know that a student who performs adequately but only adequately on some assessment tool might be a credit to my pedagogical skills or an indictment of them.

Similarly, a public school teacher who gives a kid a D is probably doing a better job if the kid fails the standardized test than if s/he passes. That teacher has failed to motivate the student whose skill level exceeds his/her grade; conversely, the teacher has accurately measured the commitment, intelligence, focus, etc. of the student who goes down in flames on the standardized test. Yet I know of no organized attempt—I’m sure some competent principals do this on a ad hoc basis—to compare student test scores to how they did in the actual classroom. That’s insane… which means it sounds great to the average state legislator.

There is a veneer of truth to DeWitt’s observations. But it’s only that. If you really want students who are college-ready, take it from someone who has seen three decades' worth of college freshmen: that test-driven model may seem turbo-charged, but when it comes down to the ability to actually think, trust me, there’s a Honda engine in that Porsche… and I’m not sure it isn’t from a lawnmower.

1 comment:

Catherine said...

Thanks!! Very written. You articulated a lot of what I have been trying to say.