July 10, 2009

Combining qualitative and quantitative evidence for teacher evaluation: What does "predominant" mean?

According to Gotham Schools, former NSVF and current USDOE official Joanne Weiss "said the Obama administration aims to reward states that use student achievement as a 'predominant' part of teacher evaluations with the extra stimulus funds" (emphasis added). I followed up with a USDOE representative, who emphasized after talking with Weiss that she meant a predominant part, not the predominant part of teacher evaluations, and that is how Walz reported the comment. The department representative added that department leaders "consider it illogical to remove student achievement from teacher evaluation, and we want states and districts to remove any existing barriers."

This came on the heels of TNTP's Widget Effect argument and Joan Baratz-Snowden's Fixing Tenure. I know that the political context of Weiss's remarks is to push the Duncan line that New York State's moratorium on the use of test scores in personnel decisions is wrong, and if necessary Weiss will bar New York from the Race to the Top funds if the legislature doesn't get its act in gear. Stand in line, please; I have a feeling a few million New Yorkers have the first dibs on dunking the entire state senate in the Hudson near Albany sometime in late November.

Back to policy, though: the word predominant perked up my ears because Florida legislature's language has evolved from language involving the dominance of student achievement to quantification. The current language on personnel evaluation is a legacy of language first written in 1999:

The assessment must primarily use data and indicators of improvement in student performance assessed annually as specified in s. 1008.22 and may consider results of peer reviews in evaluating the employee's performance. [emphasis added]

The current performance-pay language in Florida has the Merit Award Program which stipulates that for the purposes of merit pay, achievement data "shall be weighted at not less than 60 percent of the overall evaluation" (F.S. 1012.225(3)(c)).

I need to think about this in some depth, but it strikes me that the Florida legislature mandated one of several options to use in combining quantitative and qualitative judgments of teacher effectiveness, the point system. You can probably come up with other variations that meet the statutory language, but my guess is that any real-world implementation would almost all be linear combinations of different subscores, and I will use incredibly technical measurement language to call it the point system of combining different sources of information about teaching effectiveness. But that's not the only one, and I am always troubled when a clunky system is chosen as the default because it is the first option rather than a deliberate decision among options. I understand why a point system is in the bureaucratic and political gravity well, and it may well be that this particular clunky point system is the best option. However, it should be considered in comparison with what other clunky systems might be appropriate.

For example, there is also the holistic review of teacher effectiveness, such as exists in the new Green Dot-UFT collective bargaining agreement teacher evaluation system. There's no specific way that test scores inherently enter the judgment as such, though the implication is that teachers will have to show that they use assessment to shape instructional practices (what's called action research in the document, at the very least).

But those aren't all: a flow-chart is at least theoretically possible, though I do not have a real-life example. Yes, there are process flow-charts such as exists in Denver (and in the Green Dot system), but it's a flow-chart essentially describing when and how you schedule meetings, not how you make decisions in a meeting. (Step 1: Can you understand this chart? Yes: read the rest of it while walking to your secretary's desk; no: pretend to read it while walking to your secretary's desk. Step 2a [at secretary's desk]...)

Most theoretical: a Bayesian bump algorithm. I am guessing that there is a high probability that any subjective Bayesian statistician reading this blog will have thought of this idea already, but I'll adjust that guess after some data comes in. Since even well-trained evaluators are making subjective judgments about people, you could treat a principal's or peer's judgment as a prior judgment about the probability that a teacher should be retained/rewarded, given help, or fired. In the Bayesian world, that prior judgment can and should be shifted based on data, to form a posterior estimate of the probabilities of what should be done (you can play with a Bayesian calculator here, in a medical-test context). That adjustment is why I'm calling it a "bump" -- start with a professional assessment on various grounds and allow that to be bumped somewhat by test data, with the magnitude of the bumping depending on the data. Going down this path would involve some interesting studies, and it would probably be working with Bayesian posterior odds (which provide an interesting possible back door to a point system). This is a little out of my league in terms of specific characteristics, but the Bayesian perspective on statistics makes it possible to combine qualitative and quantitative data in a framework that already exists.

So we have four large categories of ways to combine essentially qualitative and quantitative data. While I am busy reading student work and doing other stuff in the next week, you all have a chance to dive in and describe what you think are strengths and weaknesses of each approach, as well as any additional categories (or disagreements with my classification entirely). After I have a weekend and get other tasks finished, I will return to explain (a) why a Bayesian approach is not only philosophically appropriate but serves the needs of unions, students, and anyone Alexander Russo describes as reformy; (b) why a Bayesian approach is not that different from a point system, at least in theory; and (c) what characteristics you would look for in a point system for teacher evaluation to meet the political interests described in (a).

Listen to this article
Posted in Accountability Frankenstein on July 10, 2009 12:20 PM |