July 14, 2010

Fat tails and audit trails in Florida test scores

I'm starting the day behind on a bunch of things, thanks to a week at the AFT convention in Seattle and the beauteous handling of bad weather by Delta. I arrived in Tampa about 23 hours after leaving Seattle, and let's leave it at that.

So I'm a bit behind on the background behind the evolving controversy over test scores in Florida. NCS Pearson was way, way late on releasing scores, and part of the reason was what Florida DOE officials called glitches in the demographic files Pearson had on students, or how test scores are tied to students and then teachers.

I have a sneaking suspicion that's also behind the controversy that's developing, as first the urban and then a bunch of other system superintendents complained that the proportion of elementary students not making adequate progress year-to-year just didn't fit with any sense of reality (on the low side). Head to the St Pete Times for the published stories and blog entries, including new complaints that the organization auditing Pearson's work is a subcontractor of Pearson, but here's the reason why I suspect the demographic files are a good starting point: Florida's "growth" measure is not the mean or median growth year-over-year on some vertical scale, nor is it a regression-based measure of deviation from some version of expected growth. Instead, it is a jerry-built dichotomous variable of whether an individual student made a particular growth benchmark in a year: yes/no.

It's been a few years since I looked at the details of this "growth" definition, but there's some inherent sensitivity in any measure based on thresholds to variability around the relevant threshold. In the case of Florida's growth measure, the vulnerability is going to be less around the construction of a particular scale at a point in an individual test because the measure depends on a student's prior-year score. So a psychometric vulnerability is going to be two sources: the general characteristics of tests in two years, and the added variability that you get from comparing scores in two years (there's measurement error in both scores, and the measurement error when you compare the scores is going to be greater than the measurement error in either base year or following year).

Since the two-year-variability issue has been a fact of life for this measure for a number of years, I would be surprised if that were the issue. So then the question is whether this year's fourth- or fifth-grade reading test scores have unusual distributions that would cause interesting problems at the thresholds for "making gains" for students who were low-performing in the prior year. A particularly fat tail at the low end might cause that, but that's speculation, and I suspect an obviously fat-tailed distribution would have been picked up by the main auditor, Buros.

But you can have a non-psychometric wrench in the works, because Florida's dichotomous variable is highly sensitive to one other matter: the correct matching of student test scores from year to year. If the student data files were messed up, and student scores from 2009 were matched to the incorrect student scores from 2010, you'd have all sorts of problems with growth. I strongly suspect that's what tipped off problems with the data files earlier in the spring. If the failures were general, you'd have a skewed distribution of the dichotomous growth variable as the lowest-performing students from 2009 would be the most likely to be matched (incorrectly) to higher scores in 2010 and vice versa, so the first clue would be markedly high growth indicators for 2009's low-performing students and markedly low growth indicators for 2009's high-performing students.

But that's not what school districts are reporting: they're reporting unusually low growth proportions for low-performing students from 2009. I can think of a few different ways you'd have that after Pearson tried to correct any obvious problems it saw earlier, but that's speculation. What needs to happen is an examination of the physical artifacts from this year for a sample of schools: the booklets, the student demographic sheets, and the score sheets. We're talking about more than a million students tested, but we can start with a sample of schools that the urban-system superintendents are worried about and track the data from beginning to end with a small enough set to see exactly what happened to the satisfaction of local school officials, policymakers, and the general public.

And if Pearson destroyed all physical artifacts so you can't trace the path of data? Cue "expensive lawyer" music...

Listen to this article
Posted in Accountability Frankenstein on July 14, 2010 7:09 AM | Submit