Last week, Mike Petrilli, President of the Thomas B. Fordham Institute, published a series of blog posts at the Education Gadfly and Education Next critiquing an AEI study by Dr. Collin Hitt, Dr. Michael McShane, and myself discussing the surprising disconnect between the achievement and attainment effects from school choice programs in the US. Throughout the week, Mike committed the Ecological Inference Fallacy, engaged in specification searching, generalized from outlier cases, and declared that we must prove a negative.
Before I begin, I want to make it clear that I consider Mike to be a friend. I am confident that Mike’s heart was in the right place when he struck at us for releasing a report containing the inconvenient (for Mike) truth that test score effects only weakly predict attainment effects in school choice studies. He wants to defend the practices that he thinks benefit children, including the practice of closing schools of choice with low test scores. That’s fine, but in doing so it is only fair to require that his major criticisms of our work be appropriate. They are not.
First, some background. In our study we draw upon the findings from 24 evaluations of various types of school choice programs to show that the achievement effects from those programs are only weakly and inconsistently predictive of their subsequent attainment effects. We use a “vote counting” form of meta-analysis to classify both the achievement and attainment impacts in each study as positive and statistically significant, positive and not statistically significant, negative and not statistically significant, or negative and statistically significant. Then we systematically compare the codes of the achievement findings with the codes of the attainment within distinct groups of studies.
In our six formal statistical tests of the hypothesis that school choice test score impacts reliably predict future attainment impacts, five of the tests do not support the hypothesis. Only test score effects in English Language Arts (ELA) are significantly predictive of any attainment results, and then only of a small set of college completion findings, not of the much larger set of college enrollment or high school graduation results. In a conference paper presented at Harvard University’s Program on Education Policy and Governance on April 19, we subject our initial findings to a variety of robustness tests, all of which they pass.
Due to this general disconnect between achievement and attainment effects of choice programs and, in a few cases in our sample, individual choice schools, we caution commentators and regulators to be more humble and circumspect in judging school choice programs and schools of choice based solely on their test score effects.
In Mike’s second post criticizing our study he claims that the test score effects of choice programs do reliably and positively predict their attainment effects because, after throwing out some cases (I’ll get to that later), “both short-term test scores and long-term outcomes are overwhelmingly positive.” Voila, our claim is destroyed, Mike suggests.
As legendary ESPN commentator Lee Corso would say, “Not so fast.” Mike committed the classic Ecological Inference Fallacy. Ecological inferences are claims that two similar frequencies regarding different factors drawn at the aggregate level from the same sample are necessarily connected at the individual level. The fallacy is in assuming that the majority of choice studies reporting positive effects on test scores is the same majority of choice studies reporting positive effects on attainment. We demonstrate in our analysis that they are not.
Here is a hypothetical that demonstrates the fallacy of Mike’s claim. Say we want to evaluate the quality of our meteorologist here in Northwest Arkansas. One way to do that would be to see how many times he predicted that it would rain, and how many days it actually did rain. Using Mike’s method, at the end of the year we would simply count up the number of days that he said it would rain and the number of days that it did rain. If he said it would rain 137 days and it rained 137 days, we’d say he did a perfect job of predicting the weather.
Do you see the problem? We don’t actually care about the aggregate total, we care if it rained when he said it would. According to Mike’s methodology, if the meteorologist said that it would rain on Monday, but it didn’t, and instead rained on Tuesday, he would score that as perfect. One prediction of rain, one day of rain. Our methodology would only score it as correct if it actually rained on Monday. This logic is why counting the aggregate totals led Mike astray.
Second, Mike argues that we use an improperly expansive definition of school choice programs. We define a school choice program as any arrangement whereby parents or students themselves select the school to attend instead of being assigned to a specific school by residence. Our definition is standard in the social science literature and largely mirrors the definition of school choice provided by Wikipedia:
School choice is a term for K–12 public education options in the United States, describing a wide array of programs offering students and their families alternatives to publicly provided schools, to which students are generally assigned by the location of their family residence.
Note the emphasis on “wide array of programs” in the conventional definition of school choice. Mike claims that we committed a “fatal flaw” in our analysis by including studies of early college high schools, selective enrollment high schools (i.e. magnet schools) and career/technical education schools of choice. Mike declares that none of these arrangements are school choice programs, even though students attend the school exclusively by choice, because, in his opinion, choosing the school is not a key element of the program. He also argues that vocational (a.k.a. Career/Technical Education) is a category where we might expect achievement effects to diverge from attainment effects. I show in the Harvard Conference paper that, contrary to Mike’s claims, the connection between achievement and attainment findings actually is stronger in the Career/Technical Education studies than in the average school choice study.
Mike declares his restrictive definition of what is a school choice program “common-sense.” It certainly isn’t common nor do I think it makes sense. We adopted the standard definition of school choice programs, developed by others, before we actually collected the studies and data, to guarantee that we weren’t engaging in “specification searching.”
What is “specification searching?” Paul Peterson and William Howell define specification searching as “rummaging theoretically barefoot through data in the hopes of finding desired results.” Generally, it involves modifying the content of your sample or the control variables in your statistical regression models until you stumble upon the combination that produces your “preferred” results. Because specification searchers actively seek a specific result until they find it, instead of testing for it scientifically, results from specification searches are almost always either random or contrived and seldom reliable.
Specification searching notwithstanding, Mike’s analysis has another problem. Our inclusive definition of school choice isn’t the “fatal flaw” that he says it is. Here is a summary of the main results from our study:
Associations between Specific Test Score & Attainment Results from School Choice Evaluations
Association | N | Trace (%) | Pearson X2 | P-Value | Gamma | Asymptotic Standard Error |
ELA-HS Graduation | 34 | 38 | 5.84 | .76 | .20 | .22 |
ELA-College Enrollment | 19 | 47 | 5.01 | .29 | .69 | .19 |
ELA-College Graduation | 11 | 64 | 11.05 | .03 | .81 | .20 |
Math-HS Graduation | 33 | 27 | 4.90 | .84 | -.12 | .24 |
Math-College Enrollment | 18 | 39 | 0.69 | .95 | -.06 | .36 |
Math-College Graduation | 11 | 45 | 3.18 | .53 | -.28 | .46 |
See our report for the details regarding the statistical measures we used. In order for us to conclude that school choice achievement results reliably predict subsequent attainment results the P-Value from the X2 test needs to be below .05 and the Gamma needs to be positive. We only observe those necessary conditions in one of the six cases, shaded in gray, when ELA effects are used to predict college graduation effects. The other five tests of the hypothesis fail to confirm that test score outcomes are a reliable predictor of attainment outcomes in school choice studies.
What happens when we exclude from our sample the school choice programs that Mike doesn’t think are school choice programs? Let’s take a look:
Associations between Specific Test Score & Attainment Results from School Choice Evaluations as Defined by Mike Petrilli
Association | N | Trace (%) | Pearson X2 | P-Value | Gamma | Asymptotic Standard Error |
ELA-HS Graduation | 22 | 27 | 6.91 | .65 | .12 | .25 |
ELA-College Enrollment | 8 | 50 | 8.07 | .09 | .50 | .50 |
ELA-College Graduation | 2 | 50 | NA | NA | NA | NA |
Math-HS Graduation | 21 | 14 | 9.82 | .37 | .01 | .25 |
Math-College Enrollment | 7 | 43 | 7.78 | .10 | .27 | .55 |
Math-College Graduation | 2 | 50 | NA | NA | NA | NA |
In short, the results from the Petrilli-defined set of school choice evaluations are even worse, from Mike’s standpoint, than the ones from our original analysis. None of the six tests yields statistical results consistent with achievement effects reliably predicting attainment effects. So, you shouldn’t truncate the sample in the way that Mike did, but even if you do, it doesn’t matter!
Mike also abandons any consideration of statistical significance and consolidates all positive findings together and all negative findings together, regardless of how random or reliable those individual results appear to be. Statistical significance is a major signal of the reliability of any results that emerge in the education field. When a positive effect of a program is not statistically significant, there is a decent chance that it really is 0 or even negative. There is an unacceptable level of uncertainty regarding whether or not the effect really is positive. Negative and insignificant effects, similarly, easily could be 0 or positive. We just don’t know.
That is why we treat positive and statistically significant effects separately from positive and not statistically significant effects, and likewise for negative effects, throughout our analyses. To claim that one knows that positive insignificant effects are truly positive is mere hubris. One might quibble over where the line should be drawn in deciding if a finding is significant or not, but to combine statistically insignificant effects with statistically significant ones is to abandon any semblance of science in policy work. Mike does so, inappropriately, here. Combining his novel definition of what is and is not a school choice program with his ignoring of statistical significance is the only way he can arrive at the results he celebrates.
Those results favored by Mike are drawn from a small subgroup of the 126 achievement-attainment pairs in our study. In his third post criticizing our study Mike makes much out of the 8 school choice studies (as he defines school choice) with ELA results and college enrollment results and the 7 studies combining math effects and college enrollment effects. Mike thinks our conclusions should be drawn from 12% of the available data. As such, he is committing the fallacy of generalizing from outliers, also known as hasty generalization.
According to logicallyfallacious.com, hasty generalization is “Drawing a conclusion based on a small sample size, rather than looking at statistics that are much more in line with the typical or average situation.” I wrote about the danger of hasty generalization in one of my earliest scientific publications, on the subject of representative bureaucracy and schools.
A sample N of 126 is modest in size for drawing conclusions, especially since the observations are not all independent of each other. The ELA effects are related to math effects and the high school graduation, college enrollment, and college graduation effects are all related to each other if drawn from the data of a single study. Our sample is effectively less than 126 observations, but Mike wants to shrink it more, by excluding cases from the sample and excluding the outcomes of high school graduation (too squishy) and college graduation (too few cases). That leaves Mike with a hasty generalization from 15 cases.
Our largest sets of independent achievement-attainment pairings come from 34 cases of ELA results paired with high school graduation results and 33 cases of math effects paired with high school graduation effects. Mike argues, reasonably, that high school graduation effects can be unreliable because they are gameable and we have concrete evidence that DC Public Schools has gamed them.
If the high school graduation effects of school choice programs in our study are unreliable, due to gaming, then we wouldn’t expect them to predict subsequent school choice attainment effects. High school graduation effects, however, are the single best predictor of future college enrollment effects of any group of pairings in our study. The relationship is strongly positive and statistically significant with 99% confidence. In short, there is no empirical justification for Mike to exclude the moderately large sample of 67 “achievement-to-high-school-graduation” pairings in our study in favor of the mere 15 “achievement-to-college-enrollment” pairings that he prefers. That is hasty generalization, from outlier cases.
In Mike’s fourth post criticizing our study he chastises us for generalizing the results from evaluations of entire school choice programs to regulator behavior regarding individual schools. First, some of the studies in our analysis, for example the Harlem Children’s Zone (HCZ), are of individual schools of choice. It is true, however, that most of them are of choice programs that span multiple schools. Second, our analysis has the advantage of drawing from statistical evaluations of programs, many of them with large samples, in order to test whether test score effects reliably predict attainment effects. At the individual school level, with a few exceptions such as the large HCZ, there are less data on school test score effects and attainment effects.
Mike claims that he, as a charter school authorizer, looks for multiple signs of poor performance before ordering a school of choice to be closed. I don’t doubt that, but I also take cold comfort from his assurance that all other charter school authorizers try to “be like Mike.” In fact, the National Association of Charter School Authorizers promotes “default closure” policies that automatically close schools based on low test score performance. Based on the evidence from our study, we wonder if those decisions aren’t premature.
Mike concludes his fifth and final post criticizing our study with the sentence: “Do impacts on test scores even matter? Yes, it appears they do. We certainly do not have strong evidence that they don’t.” There is only one way to interpret that sentence. Mike is claiming that the burden is on us to prove, conclusively, that test score effects are not associated with attainment effects in school choice evaluations, however defined.
The first rule of science is that you can’t prove a negative. The second rule of science is that the burden of proof is always on the person claiming that a relationship between two factors actually exists. One develops a theoretical hypothesis, such as “The achievement effects from school choice evaluations reliably predict their attainment effects.” One then collects as much good data as possible to test that hypothesis, certainly employing an expansive definition of school choice unless and until you have an overwhelming number of cases. One then conducts appropriate statistical tests on the data. If the results are largely consistent with the hypothesis, then one conditionally accepts the hypothesis: “Hey, it looks like achievement effects might predict attainment effects just as hypothesized.” If the results are largely inconsistent with the hypothesis, as in the case of our study, one retains a healthy amount of doubt regarding the association between achievement and attainment results of school choice evaluations. That’s what scientists do.
As a result of our findings of no consistent statistical association between the achievement and attainment effects in school choice studies we urged commentators and policymakers “to be more humble” in judging school choice programs or schools of choice based solely or primarily on initial test score effects. Nothing in Mike Petrilli’s critiques of our study lead us to alter that position.
— Patrick J. Wolf
Dr. Patrick J. Wolf is Professor and 21st Century Chair in School Choice in the Department of Education Reform at the University of Arkansas College of Education and Health Professions.