June 17, 2006

Longitudinal student database glitches

It's rare when you can combine SAS (stats package) geekery and education policy analysis, so I have to take advantage of this opportunity. This morning, I had a discussion with a staff member of the Florida Department of Education, specifically in its data-warehouse unit. Very kindly, the staff in the data warehouse allow state researchers to use the data (once identifying information such as the real school ID, name, etc., is removed). Over the past 17 months, I've been playing off and on with several data sets they sent me from the 1999-2000 and 2000-01 school years, as I've been tinkering with my ideas for measuring graduation and other attainment indicators. Someone pointed out at some point that the enrollment numbers I was working with for 1999-2000 was a chunk smaller than the next year's (over 10%). That's embarrassing! I finally did some follow-up (checked through the monthly figures) and discussed this with my acquaintance in the FDOE.


I learned today that the data set I was using (what's called the attendance file, which has enrollment and disenrollment dates) is not what the FDOE uses. For their annual enrollment count, they use a database of students uploaded by each district from those students enrolled in the relevant week (e.g., Oct. 11-15, 1999). But this enrollment file doesn't have dates of entry and doesn't always have an exit date. And the attendance file (that I was using) isn't as reliable as the enrollment file (according to my informant). Practically speaking, after I merge the sets of data, I'm left with a record of students for whom I sometimes have enrollment/disenrollment dates and codes and sometimes don't.

My strategy is fairly simple at this point: after merging the data, I impute one of the end-points for the enrollment interval and then impute the enrollment length. Because of the structure of the data (monotone, for my readers who know multiple imputation), I'm first imputing the withdrawal date and then the length of enrollment. I'm too tired to follow up with the analysis tonight, so that will wait until tomorrow.

But the gaps in coverage for enrollment are significant for anyone who thinks building a longitudinal database with individual-student records is easy to any degree. Florida has been at this longer than anyone (I think a few years longer than Texas), and we still have problems. Essentially, the data is split among many tables, and key information is entered by poorly-paid data-processing clerks at each school without significant edit checks in the software. Sometimes, that leads to records that are just silly: I found a few individuals whose records show they were born before 1900 or after 2000, including one child born in 2027 whose enterprising parents or grandparents enrolled her or him about a quarter-century before her or his birth. Now, that problem could be solved with a simple software check on dates, including an explicit question along the lines of The data you entered indicated that this student is X years old. Is that correct? Others are harder: as my acquaintance told me, records that should be uploaded (attendance records for students who are enrolled in a school) aren't.

And that doesn't touch the questions of auditing the withdrawal codes (how do we know someone showed up at another school when they said they were transferring, not dropping out?) or anything that touches on the longitudinal record of achievement. Please remember that Florida is one of the best-case scenarios with data integrity, as there's considerable investment in this data in terms of infrastructure, training, and an incremental approach to adding elements. Even with that, it's clunky and prone to errors—errors that might appear small but affects everything we have come to assume about schools (i.e., the official statistics).

Update: I forgot the SAS geeking. Last night I discovered PROC MI and PROC MIANALYZE, two procedures that make much easier the the type of multiple imputation Rubin's (1987) book describes. I realized this morning that I had made an error in the merging by including records for which one of the other variables clearly indicated the student had not attended, and so there was a spurious set of rare cases with withdrawal but not entry dates. Removing those cases means that the missing date data is in identical cases. Technically, I can impute either variable first and then impute the length of enrollment. (Quick logic puzzle for the reader: why wouldn't I just want to impute the two dates independently?)

The other information I have: school (and county), school year, race, gender, ethnicity, birth year and month, lunch-program participation, and grade (first grade, second, etc.). Obviously, the imputation has to be done separately by year (otherwise I might have starting and ending dates in the wrong academic year), and I could have separate imputations by county. I'm using the predictive mean matching for the endpoint date (to avoid dates that are beyond the ends of the school year—I'm so glad my campus's version of SAS has that option), and I'm not sure whether to use predictive mean matching or straight regression for the interval. The obvious thing is to try it different ways and see if it makes a difference.

Further update: Oh, rats. Imputing dates doesn't work, because either a regression or a predictive mean matching system gives me dates that are about 250 days apart (give or take a few days), no matter what, because the vast majority of students are in the same school for the whole academic year. That gives me less variation than calculating the variable of interest (was the person in school on day X in that year) and imputing that variable directly, so I'm going with that. But the nasty bit is that there is a lower proportion of 1999-2000 records needing such imputation than 2000-01 records. This doesn't take care of the undercoverage in the 1999-2000 record set. Yes, it's a problem.

Listen to this article
Posted in Education policy on June 17, 2006 12:24 AM |