March 13, 2008

A rescue for Swanson's CPI?

(Note: I've changed a few things in the first graph, and someone pointed out to me that I had fallen prey to one of the many Excel glitches, but I'll show the changes below...)

I've written informally here on graduation measures, expressing my concern about Chris Swanson's Cumulative Promotion Index, known among the grad-rate numerati as the CPI (e.g., 4/16/06, 6/21/06, 6/12/07). One of the CPI's weaknesses is its reliance on the annual numbers reported to the Common Core of Data. Another is the assumption that the ratio of enrollment in 10th grade in 2008 to 9th grade enrollment in 2007 is a meaningful gauge of cohort retention/promotion. In 2005, Rob Warren explained (PDF, iPaper) why that is problematic: 9th grade retention and student transfers pollute the measure. Transfers pollute all of the components of the CPI (net in-transfers artificially inflate CPI), but 9th grade retention is particularly problematic (deflating CPI).

Smoothing down the 9th-grade bump and year-to-year jiggles

I've been returning to the issue of measuring attainment (my Holy Grail, I suppose), and there are a few ways I've thought to improve on the CPI. Two obvious ones are to smooth the data and remove 9th grade as an issue:

  • Use three years of enrollment and diploma data, to smooth over single-year bumps and reporting problems.
  • Start in 8th grade and use a two-year 10th-to-8th grade enrollment ratio as the first term in CPI.

I tried that on state-level data from the beginning of the Common Core of Data to the latest year (1986-2005), and the effect of smoothing is what I had hoped: the measures of central tendency for each state are similar, but the "bumpiness" of the data is dramatically reduced. And if one looks at data within each state, over the entire time series state medians for the Swanson CPI is highly correlated to state medians for the smoothed measure (r2=.92, N=51). By starting with 8th grade, state medians for CPI tend to rise between 3% and 8%. That's not surprising.

Reframing CPI

Then I had another thought: what if we looked at grade level not as a proxy for the time in high school but as a set of gateways, requirements to meet on the path to graduation? Then the concepts behind the CPI terms could be thought of as a standard probability problem. Through a few razzle-dazzle maneuvers, I snatched some cross-sectional data from CCD, took the natural log of everything, and tossed it through regression to see if the cross-sectional data could predict the smoothed CPI, at least with the state-level data. Here's the result, with the regression prediction on the X axis and the smoothed, skip-9th-grade CPI estimate on the Y axis):


The previous graph has the same points but a more ambiguous indication of r2. For the geeks, I ran the regression on the logs of everything (there's a clear reason tied to the background for all this), and the top r2 refers to that regression. But you can also look at the translation back into percentages/ratios, and the second r2 is for the plotted graph. In this case, they're virtually identical. N=867 (17 x 51). Pretty snazzy, eh? No, I'm not releasing the details. Not until it's been highly vetted...

And I'm not going to break out the champagne, either (especially after a bit of embarrassment with the Excel glitch). At the lower end of the range for states, the prediction underestimates the smoothed CPI, and there are no guarantees how it'll perform at that low range. The different points for each "year" are not truly independent within a state, since we're working with multiple years of data (closely related to a moving average). And, as noted above, student migration can easily bias CPI, leading to CPIs above 100% with substantial migration.

States don't hit extremely low levels of graduation or high levels of migration/transfers, but districts do. I took California districts with average enrollment in grades 8-12 over several recent years of at least 3,333, removed a few elementary- or high-school-only districts (California has that odd combination) as well as others with some data anomalies, and ended up with 127 districts that make up 59% of the 8th-12th enrollment over the years in question earlier this decade. Snagged the same cross-sectional elements from the CCD.  How well does that idea work (the following graph was before fixing the spreadsheet formula error)?


That was clearly not nearly as nice as the state-level picture. A few things are important to note, here: larger aggregations tend to look different in any statistical analysis, and you'll see here a broader range of smoothed CPI predictions and estimates, including the dreaded and improbable over-100% measure.

But there was an error in the spreadsheet (caused by copying a column instead of a formula). Here's the new graph:



Much better, no? Again, I've noted both the r2 for the regression model and r2 for the translated figure, a little lower than with states and not quite as close. As before, smaller entities are going to have broader variation, but I'm a little more encouraged. Yes, I have a few ideas on how to attack that over-100% CPI, but I think that's enough for saving the world today. I have chaffeuring and journal editing to do in the next few hours... (the driving is done but I'll see if I have a bit more energy tonight)

One last point: The implications of this are a bit subtle, apart from the utility of smoothing the data with multiple years and starting with 8th grade. The ability to connect a few cross-sectional data elements tightly to the synthetic CPI does not mean that CPI is without flaws but rather that the roots of any bias in CPI are at least parallel to and probably identical to the biases in the cross-sectional data. If you slide up and down the potential biases in the cross-sectional elements, you also slide up and down the biases for the CPI.

Listen to this article
Posted in Education policy on March 13, 2008 3:11 PM |