The Weak Predictive Power of Test Scores

The school choice tent is much bigger than it used to be. Politicians and policy wonks across the ideological spectrum have embraced the principle that parents should get to choose their children’s schools and local districts should not have a monopoly on school supply.

But within this big tent there are big arguments about the best way to promote school quality. Some want all schools to take the same tough tests and all low-performing schools (those that fail to show individual student growth over time) to be shut down (or, in a voucher system, to be kicked out of the program). Others want to let the market work to promote quality and resist policies that amount to second-guessing parents.

In the following debate, Jay Greene of the University of Arkansas’s Department of Education Reform and Mike Petrilli of the Thomas B. Fordham Institute explore areas of agreement and disagreement around this issue of school choice and school quality. In particular, they address the question: Are math and reading test results strong enough indicators of school quality that regulators can rely on them to determine which schools should be closed and which should be expanded—even if parental demand is inconsistent with test results?

To a very large degree, education reform initiatives hinge on the belief that short term changes in reading and math achievement test results are strong predictors of long term success for students. We use reading and math test scores to judge the quality of teachers, schools, and the full array of pedagogical, curricular, and policy interventions. Math and reading test scores are the yardstick by which education reform is measured. But how good of a yardstick is it?

Despite the centrality of test scores, there is surprisingly little rigorous research linking them to the long-term outcomes we actually care about. The study by researchers from Harvard and Columbia (Chetty, et al.) showing that teachers who increase test scores improve the later-life earnings of their students is a notable exception to the dearth of evidence on this key assumption of most reform initiatives. But that is one study, it has received some methodological criticism (although I think that has been addressed to most people’s satisfaction), and its results from low-stakes testing may not apply to the high-stakes purposes for which we would now like to use them. This seems like a very thin reed on which to rest the entire education reform movement.

In addition, we have a growing body of rigorous research showing a disconnect between improving test scores and improving later-life outcomes. I’ve written about this at greater length elsewhere (see here and here), but we have eight rigorous studies of school choice programs in which the long-term outcomes of those policies do not align with their short-term achievement test results. In four studies, charter school programs that produce impressive test score gains appear to yield no or little improvement in educational attainment. In three studies of private school choice and one charter school choice program, we observe large benefits in educational attainment and even earnings but little or no gains in short-term test score measures.

If policy analysts and the portfolio managers, regulators, and other policy makers they advise were to rely primarily on test scores when deciding which programs or schools to shutter and which to expand, they would make some horrible mistakes. Even if we ignore the fact that most portfolio managers, regulators, and other policy makers rely on the level of test scores (rather than gains) to gauge quality, math and reading achievement results are not particularly reliable indicators of whether teachers, schools, and programs are improving later-life outcomes for students.

What explains this disconnect between math and reading test score gains and later-life outcomes? First, achievement tests are only designed to capture a portion of what our education system hopes to accomplish. In particular, they are not designed to measure character or non-cognitive skills. A growing body of research is demonstrating that character skills like conscientiousness, perseverance, and grit are important predictors of later-life success (see this, for example). And more recent research by Matt Kraft, Kirabo Jackson, and Albert Cheng and Gema Zamarro (among others) shows that teachers, schools, and programs that increase character skills are not necessarily the same as those that increase achievement test results. There are important dimensions of teacher, school, and program quality that are not captured by achievement test results. Second, math and reading achievement tests are not designed to capture what we expect students to learn in other subjects, such as science, history, and art. Prioritizing math and reading at the expense of other subjects that may be important for students’ later-life success would undermine the predictive power of those math and reading results. Third, many schools are developing strategies for goosing math and reading test scores in ways that may not contribute to (and may even undermine) later-life success. The fact that math and reading achievement results are overly narrow and easily distorted makes them particularly poor indicators of quality and weak predictors of later-life outcomes.

I do not mean to suggest that math and reading test results provide us with no information or that we should do away with them. I’m simply arguing that these tests are much less reliable indicators of quality than most policy analysts, regulators, and policy makers imagine. We should be considerably more humble about claiming to know which teachers, schools, and programs are good or bad based on an examination of their test scores. If parents think that certain teachers, schools, and programs are good because there is a waiting list demanding them, we should be very cautious about declaring that they are mistaken based on an examination of test scores. Even poorly educated parents may have much more information about quality than analysts and regulators sitting in their offices looking at spreadsheets of test scores.

I also do not mean to suggest that policy makers should never close a school or shutter a program in the face of parental demand. I’m just arguing that it should take a lot more than “bad” test scores to do that. Yes, parents can and will make mistakes. But analysts, authorizers, regulators, and other policy makers also make mistakes, especially if they rely predominantly on test results that are, at best, weak predictors of later-life success. The bar should be high before we are convinced that the parents are mistaken rather than the regulators poorly guided by test scores. Besides, we should prefer letting parents make mistakes for their own children over distant bureaucrats making mistakes for hundreds or thousands of children while claiming to protect them.

– Jay Greene

This first appeared on Flypaper.

The Weak Predictive Power of Test Scores

Latest Issue

Summer 2025

NEWSLETTER

Business + Editorial Office

Discover

More Information