December 26, 2004

Thank you, Steven Ruggles

Sometimes, there are ways to conduct research that would be impossible without the internet. In the last few days, I've culled key data sets to get a better picture of 20th century graduation and educational attainment than I was able to put in Creating the Dropout (1996), from a collected set of data that one can simply download from the project generally known as the Integrated Public Use Microdata Sample (or IPUMS) group at the University of Minnesota.

Let me focus a bit on what I produced this evening, in a few hours. I've been struggling for years with how to put together a decent portrait of high-school graduation. For my dissertation and first book, I spent months getting access to public use microdata samples on mainframes, programming them, and waiting for the results, often for hours late at night in my first apartment. Looking for possible new or arcane techniques was fairly painstaking.

This week, while looking again at some mid-1980s techniques I've been pondering for about a year, I did a "citation search" to see who had cited a key article from 1985. Lo and behold, I discovered the following:

Carl P. Schmertmann, “A Simple Method for Estimating Age-Specific Rates from Sequential Cross-Sections," Demography 39 (2002):287-310.

Within a few minutes, I had found a copy through my library's electronic subscriptions, downloaded it, and puzzled out the key points. Then I went to IPUMS, downloaded census data from 1940 thorugh 1980 (I'll need to get 1990 and 2000 separately to get the right education variables), and did a first stab. Then, tonight, I turned to the Current Population Surveys done every year in March, which IPUMS now has available from 1962. Except for 1963, there is an educational attainment question for everyone 15 and up, and that's enough for me to take about 2 million cases, put them in a data set, get some simple summary measures by survey year and age, and then turn it into the following graph:

Graph of synthetic-cohort graduation probabilities at 18, 19, and 20 years old, 1962-2003

There are a number of things I need to check here, from the problems of estimating exact-age proportions by averaging the proportions in the surrounding intervals to the assumptions made by lumping GEDs and regular diplomas together. But on first glance, it appears that this data confirms my previous claims that high school graduation has plateaued since 1970, and that people are graduating on average a little later as teenagers now.

Now, to summarize how this five-hour analysis was possible: The federal government gave IPUMS money to make the data available to researchers all over the globe. I set up my data extract in about 90 seconds, downloaded it after waiting about 2 minutes for the extract to be set up, waited another 3 minutes on the download, and then processed it and set up the graph when all was said and done in about 2 hours of work. The longest step on my laptop was waiting for the computer to read the raw data, about 90 seconds. There are other things I'm not explaining, about recoding of variables, etc., but the larger point is that many of the things that would have taken months and enormous frustration were gone, letting me focus on key issues that do matter substantively.

No, I don't expect this graph to appear as is. This is, after all, a very first draft of work. But it's enormously fun to get this far this quickly on something.

Oh, and Steven Ruggles? He's the head of the IPUMS group, one of those changing how research gets done—and done more easily—with the internet.

Listen to this article
Posted in Research on December 26, 2004 11:53 PM