Simpson’s Paradox Hides NAEP Gains (Again)

By 05/01/2015

3 Comments | Print | NO PDF |

Is the education space mature enough to handle NAEP tests every two or four years? I’m not so sure.

NAEP is the “Nation’s Report Card.” It takes a representative sample of United States students and tests them in reading, math, social studies, science, the arts, etc. There are several versions of NAEP intended to sample different groups of students–the nation as a whole, individual states, or large cities–but its overall goal is to provide citizens a snapshot of how we’re doing as a country.

The United States is a big country, and it takes a long time to move the needle on student achievement scores. Depending on the subject and sample, NAEP releases test results every two or every four years. When those scores come out, they almost always look flat. Once this same “flat” result gets repeated over and over, that starts to seep into our collective consciousness about how American students are doing.

But that’s the wrong way to look at it. From a long-term perspective, the achievement levels of American students are at or near all-time highs. Some groups of students are doing particularly well. The achievement scores of black, Hispanic, and low-income students have increased dramatically.

Because NAEP takes a representative sample, it’s also vulnerable to something called Simpson’s Paradox, a mathematical paradox in which the composition of a group can create a misleading overall trend. As the United States population has become more diverse, a representative sample picks up more and more minority students, who tend to score lower overall than white students. That tends to make our overall scores appear flat, even as all of the groups that make up the overall score improve markedly.

Recent NAEP results in history, geography, and civics illustrate this trend once again. Education Week reported that scores were “flat” from 2010 to 2014. That’s mostly true–the scores were all higher than in 2010 but didn’t meet the standard for statistical significance. But scores are up over longer periods of time. Here are the gains since 2001 on geography (* signifies statistically significant):

• All students: +1
• White students: +4*
• Black students: +7*
• Hispanic students: +9*
• Students with disabilities: +8*
• English Language Learners: +7

Here are the gains since 2001 on history:

• All students: +7*
• White students: +9*
• Black students: +11*
• Hispanic students: +17*
• Students with disabilities: +15*
• English Language Learners: +12*

And here are the gains since 1998 on civics (civics has a slightly longer time period of comparable data):

• All students: +3*
• White students: +6*
• Black students: +6
• Hispanic students: +14*
• Students with disabilities: +13*
• English Language Learners: +14*

A few things jump out from these longer-term results. First, overall scores are up a little bit, but particular groups of students are making big gains. One rule of thumb suggests that 10-15 points on the NAEP translates into one grade level. Applying that here, scores for most groups of students have improved by roughly a full grade level over the last 15 years or so. Second, achievement gaps are closing as lower-performing groups are catching up to higher-performing ones. Third, Simpson’s Paradox makes the overall scores look relatively “flat.” Don’t let that mislead you. Although we might wish for faster progress, American achievement scores are rising.

– Chad Aldeman

This first appeared on Ahead of the Heard

Comment on this article
  • Jay P. Greene says:

    It is not appropriate to explain away the lack of aggregate progress in academic achievement by referencing Simpson’s Paradox and dis-aggregating results by racial/ethnic group. I explained this mis-use of Simpson’s Paradox in a blog post a few years ago. See

    Here is a taste of the argument:

    “the unstated argument behind the use of Simpson’s Paradox to explain the lack of educational progress [is that] minority students are more difficult to educate and we have more of them, so holding steady is really a gain.

    The problem with this is that it only considers one dimension by which students may be more or less difficult to educate — race. And it assumes that race has the same educational implications over time. Unless one believes that minority students are more challenging because they are genetically different [which I do not think Chad believes], we have to think about race/ethnicity differently over time as the host of social and economic factors that race represents changes. Being African-American in 1975 is very different from being African-American in 2008. (Was a black president even imaginable back then?) So, the challenges associated with educating minority students three decades ago were almost certainly different from the challenges today.

    If we want to see whether students are more difficult to educate over time, we’d have to consider more than just how many minority students we have. We’d have to consider a large set of social and economic variables, many of which are associated with race. Greg Forster and I did this in a report for the Manhattan Institute in which we tracked changes in 16 variables that are generally held to be related to the challenges that students bring to school. We found that 10 of those 16 factors have improved, so that we would expect students generally to be less difficult to educate.” See

  • Sandy Kress says:

    I encourage folks to read my two comments on Ahead of the Heard. What’s most interesting in these data is not that there was some improvement over the 20 years, but that there was a flat-up-flat trend over the period.

    Chad is right: the results on a disaggregated basis have improved over 20 years. But it’s not a straight line. It was basically a flat line from the 90s to 2001. And, while there’s been some spotty improvement from 2010 to 2014, it’s been mostly flat in this most recent period, too.

    The big gains were from 2001 to 2010. Was this due to the powerful force of consequential accountability, the movement that Hanushek identifies as beginning in the states in the late 90s and was deepened and extended nationally by NCLB? There is no clear proof of it. Yet, it’s worth evaluating.

    My hypothesis is that this may have been so. Whether causal or correlational, here’s yet another piece of data that “something happened” in the late 90s and early-mid 2000s that corresponds with the best improvement we’ve seen in student achievement in decades.

  • Andrew J. Coulson says:

    If only there were a “Long Term Trends” version of the NAEP, to really get a handle on the, um, long term trends. And if only it tested kids near the end of high school–say when they’re 17! Boy wouldn’t that be helpful! But what a shame if it showed essentially a flat line across subjects not just for the aggregate student population but for the white subgroup in particular (who still make up the majority of test takers). Nothing paradoxical about that.

  • Comment on this Article

    Name ()


    Sponsored Results

    The Hoover Institution at Stanford University - Ideas Defining a Free Society

    Harvard Kennedy School Program on Educational Policy and Governance

    Thomas Fordham Institute - Advancing Educational Excellence and Education Reform