June 19, 2010

What uses of test scores will pass legal muster in teacher evaluations?

Legal considerations on the use of test score derived stats in teacher evaluation: Scott Bauries started an interesting discussion June 2 of value-added measures and teacher evaluations from a legal perspective. It's very important to read the comment thread, as he's challenged on his conclusions by Bruce Baker and Preston Green, especially with regard to disparate-impact claims. Bauries claims that employers need to defend the procedural due process but are probably safer on the substance, regardless of the problems with value-added measures.


Reading the main entry and discussion, I lean strongly towards' Bauries' conclusion, with one important caveat (below). My impression of the 2000 G. I. Forum v. Texas Education Agency case on the disparate impact of high-stakes graduation tests, which the state won, was that the plaintiffs were not prepared for the last burden-switching test on disparate impact. My rough impression of disparate-impact claims of illegal discrimination based on the Civil Rights Act: it's a series of penalty kicks/shots in soccer/hockey or maybe the games with alternating possession in overtime. I'm not a lawyer, and this is primarily based on my understanding of Title VI rather than Title VII law, but to the probably-inapt analogy: First, the plaintiffs try to demonstrate that a mechanism such as a test affected a property interest of the plaintiffs and had a disparate impact on one of the protected classes. If the plaintiffs succeed, the defendant tries to demonstrate that the mechanism meets an important interest, was properly constructed and applied, and members of the affected class had a fair chance at succeeding in the mechanism.

So far, we're describing lots of situations that have evolved in the past 25-30 years, especially with high stakes testing. Debra P. v. Turlington established the basic federal expectations in terms of student tests, and as a number of states created a new round of graduation tests in the 1990s, they relied on Debra P. v. Turlington as a guide to meeting the basic questions and getting to the final round all tied up. And this sort of makes sense if you think about the maturity of various mechanisms: you can argue that there is a rational state interest in a certain outcome (an adequate measure of achievement in the case of graduation requirements), and then satisfying the "fair chance at succeeding" is often a question of satisfying a set of criteria rather than perfection and that's often a reflection of the organization's experience and capacity.

The final test is whether there is a better option: could the defendant have feasibly chosen an alternative mechanism that satisfies the same interest with less impact. I've never read all of the materials in the G.I. Forum case, but the following is a key passage in Judge Prado's ruling:

The Plaintiffs were able to show that the policies are debated and debatable among learned people. The Plaintiffs demonstrated that the policies have had an initial and substantial adverse impact on minority students. The Plaintiffs demonstrated that the policies are not perfect. However, the Plaintiffs failed to prove that the policies are unconstitutional, that the adverse impact is avoidable or more significant than the concomitant positive impact, or that other approaches would meet the State's articulated legitimate goals. In the absence of such proof, the State must be allowed to design an educational system that it believes best meets the need of its citizens. (emphasis added)

In the end, the plaintiffs' lawyers in the Texas case were unable to provide a clear alternative to high-stakes testing that they could demonstrate was both feasible (i.e., wouldn't cost an arm and a leg) and would have a lower disparate impact. I'm not too worried about the state interest, since you can usually construct alternative mechanisms that have facial validity and that have roughly the same "noise" as whatever you're arguing against. And the not-an-arm-and-a-leg criteria is tougher to meet if you're arguing for portfolios, since it increases the cost... but it starts from a relatively low base of cost per-pupil. Ultimately, though, it is hard to argue that a prospective alternative would result in a lower disparate impact if it is only prospective and thus you have no evidence whether the protected class you're worrying about would be helped by the alternative.

So in the discussion over at EdJurist, Bauries's clinching argument is really that for all their flaws, value-added measures are going to look reasonable to a judge in that they try to adjust for incoming achievement of students and plaintiffs will have to put forward an alternative with concrete evidence that the alternative does a demonstrably better job at treating teachers fairly. The catch-22: without a working model of alternatives with that record, plaintiffs are going to be sunk on disparate-impact claims.

Bruce Baker has followed up on Bauries with a set of tongue-in-cheek impossible criteria to make the use of value-added measures reasonably fair. I understand the temptation, but he's onto one thing: ultimately, local K-12 unions will have to figure out how to respond. This will include whether they have separate evaluation procedures for the 20% of teachers for whom value-added measures are even possible, how to mix the data, and so forth.

And now for the caveat: a good part of the legal consequences of using student test scores for personnel decisions will depend on how stupid local administrators are in the first jurisdictions to use them, and the first that are challenged. I can imagine districts where administrators are careful to fire experienced teachers only where there is a record of several years of low statistical measures of student achievement and only where that is consistent with low marks in other areas, such as administrator and peer observations. I can also imagine districts where administrators purge teachers based on a single year's worth of data and with no checks of consistency with other sources of information. If the legal tests are in jurisdictions with the first set of practices, they're far more likely to pass muster than if the first cases are for terminations that don't meet a basic smell test of rationality.

Listen to this article
Posted in Education policy on June 19, 2010 11:40 AM | Submit