November 28, 2005

Growth models

All right—having beaten a future article for Education Policy Analysis Archives halfway into shape, I'm taking some time for relaxation and my sidewise way of looking at education policy, or at what passes for it.

Since the announcement this month that the Department of Education would promote the piloting of so-called growth models of accountability, there have been a number of reactions, many of which are skeptical, from George Miller and Ted Kennedy, the Citizens' Commission on Civil Rights (a private organization, despite the similarity in names to the official U.S. Civil Rights Commission), Education Trust, and Clintonista Andrew Rotherham, who points out that only a few states have even close to the sufficient longitudinal-database elements to carry this off.

Since a few journalists have had a reaction-fest with this, there has been no acknowledgment of the existing literature on so-called growth models, their political implications, or the gaps in the literature....

I'll state up front that it's fine to focus on political questions—moreover, I've argued in The Political Legacy of School Accountability Systems that the political questions are the important ones, ultimately, and it's impossible to have a technocratic solution to political problems—just so long as you don't ignore the technical issues (and for that, see Linn, 2004). Haycock of the Education Trust is ultimately right about the focus on philosophical questions, regardless of whether I might agree with her on specifics.

Big political questions

So what are the policy/political questions? A few to consider:

  • The dilemma between setting absolute standards and focusing on improvements. As Hochschild and Scovronick (2003) have pointed out, there's a real tension between the two, and it's impossible to completely resolve the two. On the one hand, there are concrete skills adults need to be decent citizens (yea, even productive). On the other hand, focusing entirely on absolute standards without acknowledging the work that many teachers do with students with low skills is unfair to the teachers who voluntarily choose to work in hard environments. And, no, I'm not going to take BS from either side claiming that, on the one hand, we need to be kind to kids (and deny them the skills they need??) or, on the other hand, that we need to take a No Excuses approach towards those lazy teachers (and who are you going to find to teach in high-poverty schools when the teachers you've insulted have left??)
  • The question of how much improvement to expect. Here, Bill Sanders' model (we'll take it on faith for the moment that he's accurately representing his model—more later on this point) is close to an average of one-year's-growth-per-year-in-school (see Balou, Sanders, & Wright, 2004, for the most recent article on his approach). But for students who are behind either their peers or where we'd like them to be, Haycock is right: one year's growth is not enough (see Fuchs et al., 1993, for a more technical discussion and the National Center on Student Progress Monitoring for resources).
  • The tension between the public equity purposes of schooling and the private uses of schooling to gain or maintain advantages. Here's one thought experiment: Try telling wealthy suburban parents, We want your kids to improve this year, but not too much because we want poor kids in the city or older suburb nearby to catch up with your children in achievement and life chances. If anyone can keep a straight face while claiming the parents so told would just sit back and say, Sure. That's right, I have some land to sell you in Florida.
  • Where is intervention best applied? Andrew Rotherham's false dichotomony between demographic determinists and accountability hawks aside, arguments by David Berliner are about where to intervene to improve children's learning, not about giving up. (I should state here that of course I have heard teachers and some of my students fall into the trap of this dichotomy, but that's a constructed dynamic from which we can and must escape. To dismiss Berliner and others as if they fall into the trap is to shut off one escape route. Shame on those who careless elide the two.)
  • Assumptions that technocratically-triggered sanctions based on (either) growth or absolute formulae work. I am yet to be convinced that such a kick-in-the-pants effect is strong enough or without side effects. This is not to say that I don't believe in coercion. I am just a believer in shrewd coercion, not the application of statistical tubafors (you'll have to search for the term on that page).

Statistical issues with multilevel modeling

Among education researchers, probably the tool(s) of choice for growth right now for measuring growth is so-called multilevel modeling. Explaining why multilevel modeling is the tool of choice for growth is probably an accident of recent educational history (combining the more recent pushes for accountability with the development of multilevel statistical tools), but it allows a variety of accommodations to the real life of schools, where students are affected not only by a teacher but a classroom environment in common with other kids as well as the school and their own characteristics (and family characteristics). That's a mouthful and only skims the surface.

Of multilevel modeling pioneers, the best of the bunch by far (beyond Bryk and Raudenbush, whose names are most familiar in the U.S.) is Harvey Goldstein, whose papers for downloading is a treasure-trove of introductory material for those who have some statistical background. The Centre for Multilevel Modeling (which he founded) is one broader source, as is UCLA's multilevel modeling page and Wolfgang Ludwig-Mayerhofer's. A Journal of Educational and Behavioral Statistics (Spring 2004) special issue on value-added assessment is now required reading for anyone looking at multilevel modeling and the question of adjustment for demographic factors.

But there are both technical and policy/political issues with the use of multilevel modeling software (and I use that more generic term rather than referring to specific software packages or procedures). Let me first address some of the technical issues:

  • Vertical scaling. In some statistical packages, there is a need for a uniform scale where the achievement of students at different grades and ages are on the same scale. That way, a score of a student who is 7 can be compared to an 8-, 9- or 10-year-old's achievement, resulting in some comparison across grades. This is not necessary with packages that use prior scores as covariates, but anything that looks at a measure of growth in some way strongly begs for a uniform (or vertical) scale. There are two problems with such vertical scaling, stemming from the fact that it is very, very difficult to do the type of equating across different grades (and equivalent curricula!) that is necessary to put students on a single scale. Learning and achievement is not like weight, where you can put a 7-year-old and a 17-year-old on the same scale. Essentially, equating is a type of piecemeal process of pinning together a few points of separate scales (each more closely normed). At least two consequences follow:
    1. Measurement errors in a vertical scale will be larger than errors in a single-grade scale, which test manufacturers have far more experience norming.
    2. The interpretation of differences in a vertical error will be rather difficult. One reason is the change in academic expectations among different grades, unless you narrow testing to a limited range of skills. But the other reason is subtler: the construction of a vertical scale can only be guaranteed to be monotonic (higher scores in a single-grade test will map to higher scores in the cross-grade, vertical scale), not linear. There will almost inevitably be some compression and expansion of the scale relative to single-grade test statistics. That nonlinearity is not a problem for estimation (since models of growth can easily be nonlinear). But the compression/expansion possibility makes interpretation of growth difficult. Does 15-point growth between ages 10 and 11 mean the same thing as 15-point growth between ages 15 and 16? Who the heck knows!
  • Swallowing variance. As Tekwe et al. (2004) point out in a probably-underlooked part of their article, the more complex models of growth swallow a substantial part of the available variance before getting to the "effects" of individual schools and teachers. This is inevitable with any statistical estimation technique with multiple covariates (or factors, independent variables, or whatever else you want to call them), but it has some serious consequencees for using growth models for accountability purposes. It erodes the legitimacy of such accountability models among statistically-literate stakeholders, who see that most variance is accounted for (even if in a noncausal sense) by issues other than schools and teachers. In addition, this process leaves the effect estimates for individual teachers and schools very close to zero and each other. Thus, with Sanders' model used in Tennessee, the vast majority of effects for teachers (in publicly-released distributions) are statistically indistinguishable. Never mind all my other concerns about judging teachers by technocracy: this just isn't a powerful tool even for summative judgments.
  • Convergence of estimates. In the packages I know, the models don't always converge (result in stable parameter estimates), given the data. Researchers with specific, focused questions will often fiddle manually with equations and the variables to achieve convergence, but you can't really do idiosyncratic adjustments in an accountability system that claims to be stable and uniform over time—or, rather, you shouldn't make such idiosyncratic adjustments and keep a straight face in claiming that the results are uniform and stable over time.

Political complications of multilevel models

In addition to the technical considerations, there are issues with multilevel modeling that are more political in nature than technical/statistical:

  • Omissions of student data. This is true of any accountability system that allows exemptions, but it's especially true of any model of growth that omits students who move between test dates. It's a powerful incentive for schools to perform triage on marginal students in high school, either subtly or openly. I've heard of such triage efforts in Florida, though it's hard to demonstrate intentionality. But even apart from the incentive for triage, it's hard to claim that any accountability system targets the most vulnerable when those are frequently the students who move between schools, systems, and states. And the more years included in a model, the less that movers count in accountability.
  • The complexity factor. Technical issues with complex statistical models are, well, complex and difficult to understand without some statistical background, and such complexity requires sufficient care with educating policymakers. That's especially important with growth models, which are pretty easy to sell to lawmakers who may be looking for a technocratic model that they don't have to think too hard about. Here's a reasonable test: will Andrew Rotherham's blog ever mention the technical problems with growth models? Will the briefs put out by various education policy think tanks explain the technical issues, or will they prove the term to be an oxymoron?
  • Proprietary software. I think that William Sanders still holds all data and the internal workings of his package to be proprietary trade secrets, even though they're used as public accountability mechanisms in Tennessee, at least (anywhere else, dear readers?) (Fisher, 1996). How can anyone justify using a secret algorithm for public policy in an environment (education) where everyone (and the justification for accountability itself) expects transparency? (For other commentaries about Sanders' model, see Alicias, 2005; Camilli, 1996; Kupermintz, 2003, and an older description of my own involvement in the earlier discussions of Tennessee's system. For his own description, see Balou, Sanders, & Wright, 2004; Sanders & Horn, 1998.)

Life-course models

One of my concerns with the increasingly complex world of statistical models of growth is their amazing disconnect from fields that should be natural allies. We have great statistical packages that are incredibly complex, but some days they seem more like solutions in search of problems than a logical outgrowth of the need to model growth and development in children.

As stated earlier, one problem is the attempt to put student skills, knowledge, and that vague thing we call achievement in an area on one scale. Unlike weight, there isn't a cognitive measuring tool I'm aware of in which all children would have interpretable scores—nonzero measures on an equal-interval scale, to choose one goal. But for now, let's assume that someday psychometricians find the Holy Grail of vertical scales (or maybe that would be a Holy Belay Line to climb down after scaling the...). Even waving away that problem, I'm still troubled by the almost gory use of statistical packages without some thoughts about the underlying models.

Even if one were interested largely in describing rather than modeling growth, you could start with nonparametric tools such as locally-weighted regression (or LOESS) and move on to functional data analysis. Those areas of statistics seem logical ways to approach the types of longitudinal analysis that the call for modeling growth seems to require.

Then there is demography. I'll admit I'm a bit partial to it (having a masters from Penn's demography group), but few education researchers have any formal training in a field whose model assumptions are closer to epidemiology and statistical engineering analysis than psychometrics. In demography, the basic conceptual apparatus revolves around analyzing the risk of events that a population is exposed to. The bread and butter of demography are births and deaths, or fertility and mortality. The fundamental measure is the event-occurrence rate, and the conceptual key to mathematical demography is the assumption that behind any living population is a corresponding stationary population equivalent, a hypothetical or synthetic cohort that one can conceive as exposed to the conditions in a population in a period of time rather than conditions a birth cohort experiences. It's as if you had a time machine at the end of Dec. 31, 1997, and a group of 1000 babies born all at the first instant of January 1, 1997, would be flipped back to the beginning of the year for all who survived to the end. It's an imaginary, lifelong version of Groundhog Day, but one with the happy consequence that the synthetic cohort would never hear of Monica Lewinsky. What happens to that synthetic cohort never happens to a real birth cohort, but it does capture the population characteristics of 1997. You can find the U.S. period life table for 1997 online in a PDF file, with absolutely no mention of Monica Lewinsky. (There is much I'm omitting in this description of a stationary population equivalent, I know!)

Demography offers a few aids to this business of modeling growth, because its bailiwick is looking at age-associated processes. Or, as a program officer for the National Institute on Aging explained at a conference session I attended a few weeks ago, aging is a lifelong process. Trite, I know, but it's something that the growth-modeling wannabes should learn from, for two reasons.

One is the equally obvious (almost Yogi Berra-esque) observation that as children grow older, their ages get bigger. Unfortunately, most school statistics are reported by administrative grade, not age, but this makes comparability on almost any subject (from graduation to achievement) virtually impossible. The only reputable source of national information about achievement that I'm aware of based on age, not grade, is the NAEP Long-Term Trends reports, pegged to 9-, 13-, and 17-year-olds tested in various years from 1971 to 2004. Some school statistics used to be reported by age—age-grade tables, which I'm finally figuring out how to use reasonably. But you could have some achievement testing conducted by age and ... well, enough of that rant.

The broader use of demography should be the set of perspectives and tools that demographers have developed for measuring and modeling lifelong processes. Social historians have an awkward term for this—life-course analysis. What changes and processes occur over one's life, and how do you analyze them? Some education researchers acknowledge at least a chunk of this perspective, most notably in the literature on retention, where you cannot take achievement in a specific grade's curriculum as evidence of the (in)effectiveness of retention in improving achievement. You can only find out the answer by looking at what happens to children as they grow older.

Some of the more sophisticated mathematical models of population processes have direct parallels in education that could be explored fruitfully. To take one example unrelated to achievement growth, parity progression (women's moves from having 0 children to 1 to 2 to ...) is an analog of progression through grades, and more could be done with using parity progression ratio estimates to see what happens with grade progression.

But, to growth... variable-rate demographic models hold some considerable promise at least in theory for analyzing changes from cross-sectional data. In the standard (multilevel model) view, you focus on longitudinal data and toss cross-sectional information, because (you think) that there is no way to separate out cohort from real growth effects. Aha! but here demography has an idea—stationary population equivalents—and a tool—variable-rate modeling. While the risk model of demography requires proportionate changes, natural logs, and e to the power of ... well, you get the idea, I'm going to provide a brief sketch and two possible directions. For more details, see Chapter 8 of Preston, Heuveline, and Guillot (2001). (And remember, we're magically waving away all psychometric concerns. We'll get back to that a bit later.)

We're going to consider the measured achievement of 10-year-olds in 2005 (on a theoretically perfect vertically-scaled instrument) in two different ways, one related to changes among 10-year-olds and a second way, in the experience of this cohort, and use that to relate observed information from two cross-sectional testing administrations to the underlying population dynamics (in this case, achievement growth through childhood).

First, let's compare the achievement of 10-year-olds in 2006 to 10-year-olds in 2005. It doesn't matter whose is better (or if they're equal). My son is now 10 years old (and will still be 10 for the next round of annual tests here in Florida), so let's suppose that the achievement of 10-year-olds in 2004 is higher than for 10-year-old students the year before. Then we could think of achievement as follows:
The achievement of 10-year-olds in 2006 = achievement of 10-year-olds in 2005 and some growth factor in achievement among 10-year-olds between 2005 and 2006
For now, it doesn't matter whether the and refers to an additive growth factor, a proportionate one, or some other function. And if the 10-year-olds in 2005 did better, the growth factor is negative, so it doesn't matter who did better.

Second, let's compare the achievement of 10-year-olds in 2006 to 9-year-olds in 2005 in a parallel way:

The achievement of 10-year-olds in 2006 = achievement of 9-year-olds in 2005 and some growth factor in achievement between the ages of 9 and 10 for 2005-06.
Note: this "growth factor" is part of the underlying population characteristic that we are interested in (implied growth in achievement between ages, across the ages of enrollment).

Now, let's combine the two statements into one:

the achievement of 10-year-olds in 2005 and some growth factor in achievement among 10-year-olds between 2005 and 2006 =
the achievement of 9-year-olds in 2005 and some growth factor in achievement between the ages of 9 and 10 for 2005-06.
Without assuming any specific function here, this statement explains the relationship between cross-sectional information across ages as one that combines changes within a single age (across the period) and changes across ages (within the period). Demographers' models of population numbers and mortality are proportional, so the and in both cases are multiplicative functions. But one could assume an additive function, also, or something else (a variety of functions), and the concept would still work. Once one estimates the changes within single years of age, one can then accumulate those differences and, within the model, estimate the underlying achievement growth between ages, which is the critical information of interest. When the interval between test administrations is equal to the interval between the ages (four years, for NAEP long-term trends), then the additive version with linear interpolation of age-specific change measures is identical to the change between 9-year-olds in 1980 and 13-year-olds in 1984, etc. But this method allows estimating those period-specific rates when the test dates aren't as convenient, and the exponential estimates are different.

Of course, this assumes perfect measurement, something that I'd be very cautious of, especially given the paucity of data sets apart from the NAEP long-term trends tables. I've played around with those, and the additive and proportionate models come up with virtually identical results with national totals, assuming linear change in the age-specific growth measures (since we only have measures for 9-, 13-, and 17-year-olds).

NAEPmath.gif
(Units for the vertical axis come from the NAEP scale.)

NAEPreading.gif
(Changing the interpolation of age-specific growth rates to a polynomial fit doesn't change the additive model much. It shrinks the estimates of growth in the exponential model a bit but doesn't change trends. And, yes, I'm aware of the label problem: arithmetic should be additive or linear.) Click on either graph to see a larger version.

There are odd results (does anyone know of reasons why the reading results were unusually high in 1992? are the results for 17-year-olds in 2004 unusually low for any reason? I was using the bridge results), and there are all sorts of caveats one should use for this type of analysis, from the complexity of estimating standard errors of derived data to changes in the administration for students with disabilities to the comparability of 2004 results and, oh, I'm sure there's more. The point is that demographic methods provides some feasible tools precisely for looking at age-related processes, if we'd only look.

References

Alicias, E. R. Jr. (2005). Toward an objective evaluation of teacher performance: The use of variance partitioning analysis, VPA. Education Policy Analysis Archives, 13(30).

Balou, D., Sanders, W., & Wright, P. (2004). Controlling for student background in value-added assessment of teachers. Journal of Educational and Behavioral Statistics, 29(1), 37–65.

Camilli, G. (1996). Standard errors in educational assessment: A policy analysis perspective. Education Policy Analysis Archives, 4(4).

Fisher, T. H. (1996, January). A review and analysis of the Tennessee Value-Added Assessment System. Part II. Nashville, TN: Comptroller of the Treasury.

Fuchs, L.S., Fuchs, D., Hamlett, C.L., Walz, L., & Germann, G. (1993). Formative evaluation of academic progress: how much growth can we expect? School Psychology Review, 22, 27-48.

Hochschild, J.L., & Scovronick, N.B. (2003). The American dream and the public schools. New York: Oxford University Press.

Kupermintz, H. (2003). Teacher effects and teacher effectiveness: A validity investigation of the Tennessee value-added assessment system. Educational Evaluation and Policy Analysis, 25(3), pp. 287 – 298.

Linn, R. L. (2004). Accountability models. In S. H. Fuhrman & R. F. Elmore (Eds.), Redesigning accountability systems for education (pp. 73–95). New York: Teachers College Press.

Preston, S.H., Heuveline, P., & Guillot, M. (2001). Demography: Measuring and modeling population processes. Malden, MA: Blackwell Publishers.

Sanders, W. L. & Horn, S. P. (1998). Research findings from the Tennessee value added assessment system (TVAAS): Implications for educational evaluation and research. Journal of Personnel Evaluation in Education, 12(3), 247–256.

Tekwe, C. D., Carter, L. R., Ma, C., Algina, J., Lucas, M. E., Roth, J., Ariet, M., Fisher, T., & Resnick, M. B. (2004). An empirical comparison of statistical models for value-added assessment of school performance. Journal of Educational and Behavioral Staistics, 29(1), 11–35.

Update! (12/2)

Today, the Financial Times is publishing an article on the UK system of league tables, and reporter Robert Matthews cites Harvey Goldstein extensively. Thanks to Crooked Timber for the tip.

Update (12/8)

I foolishly forgot to mention a 2004 RAND publication, Evaluating Value-Added Models for Teacher Accountability, which describes the limits of growth models for accountability. Thanks to UFT's Edwize blog for point it out (though I have a few bones to pick with the larger post—don't have enough time to right now...).

Update (12/13)

Andrew Rotherham discusses two technical issues with growth models (longitudinal databases and vertical scaling of measures), to his credit. Listen to this article
Posted in Education policy on November 28, 2005 3:38 PM |