This post also appears on Rick Hess Straight Up.
Last summer, the Los Angeles Times created a furor with its hotly debated decision to post the value-added scores for thousands of Los Angeles teachers and to identify individual teachers, by name, as more or less effective. This week, the situation roared back to life when University of Colorado professor Derek Briggs, and coauthor Ben Domingue, issued a report titled “Due Diligence and the Evaluation of Teachers” which charged that the L.A. Times analysis was “based on unreliable and invalid research” and that the use of an alternative value-added model might have changed how half of 3,300 fifth-grade teachers were rated when it came to reading. Even the Huffington Post has got into the action, running a solid post by Chuck Kerchner.
The issue has been rife with drama, with the L.A. Times breaking the story on an examination of its own research and controversy about the fact that Briggs’ analysis was sponsored by an entity that gets substantial support from teacher unions. The Washington Post‘s invaluable Nick Anderson penned an illuminating take on the whole question Tuesday, in which the bottom line was provided by Harvard’s Tom Kane. The dueling results showed that “when you control for different sets of variables, the estimates vary,” said Kane. “But we still don’t know yet which model was the right one and how far off from the truth the various estimates are.”
Beyond that, after communicating with the various parties, it strikes me that there are four key issues. Despite the heated language that’s been used, I don’t think the questions here are generally black-or-white. Rather, it strikes me that both the L.A. Times‘ and Briggs’ analyses have merit, and that, perhaps more than anything, this dispute shows just how sensitive value-added determinations of teacher effectiveness are to technical considerations.
First, there’s the question of how much variability we can live with in any kind of rating scheme. Briggs and Domingue argue that thousands of teachers identified as ineffective or effective under the L.A. Times analysis appear to be average if the model is modified. More than half of the teachers had a different effectiveness in reading using the Briggs analysis than they did using the L.A. Times model, and about 40 percent had a different effectiveness ranking in math. Moreover, when it came to reading, Briggs and Domingue reported, “8.1% of those teachers identified as effective under our alternative model are identified as ineffective in the L.A. Times model, and 12.6% of those identified as ineffective under the alternative model are identified as effective by the L.A. Times model.” With regards to math, they found, “1.4% of those teachers identified as effective under the alternative model are identified as ineffective in the L.A. Times model, and 2.7% would go from a rating of ineffective under the alternative model to effective under the L.A. Times model.”
Lead L.A. Times reporter Jason Felch has pointed out that the Briggs reanalysis focused on the white paper by economist Richard Buddin that provided the basis for the L.A. Times study, rather than the L.A. Times data itself. This turns out to matter a good deal, because the actual L.A. Times results were based on a more restricted sample. The L.A. Times only used teachers who had at least 60 student records, which means the data are less noisy (e.g. less likely to bounce around) than the larger population examined in Buddin’s paper. Briggs acknowledges this point but argues that the takeaway remains the same, which is that “it is likely that there are a significant number of false positives (teachers rated as effective who are really average), and false negatives (teachers rated as ineffective who are really average) in the L.A. Times‘ rating system.”
My take: There is no “right” answer here. It’s inevitable that any system will unfairly label teachers, or unfairly fail to recognize them (as is all too common today). And, as the Hoover Institution’s Rick Hanushek told me, “It’s not clear how much year to year variation there is in a teacher’s performance anyway. These results capture the stable part, but teachers may get divorced or married or have problems with their own kids.” Similarly, “Another reason for instability might be that the teacher is going to school to get a master’s degree and is spending a lot of time on these activities.” All of this means a teacher’s value-added may simply vary year to year. The question is whether a rating system in which results are consistent 60% of the time, or 80% of the time, or 99% of the time, are accurate enough to be used in a particular fashion. That’s a judgment call, and one that’s the prerogative of policymakers, parents, and educators–not statisticians.
Second, Buddin’s white paper is built around a relatively simple value-added model, and that model provided the basis for the Times‘ reporting. There are many ways to specify value-added models, and the results fluctuate pretty substantially depending on the specification that you use. It’s not clear that Buddin’s simpler model was “wrong,” but Briggs raises fair questions about the specifications used–and the absence of discussion in Buddin’s white paper about alternative models or the sensitivity of his model to those specifications. On that point, Felch notes, “It is perhaps fair criticism to note that we did not include all our due diligence in Buddin’s white paper, but to say we never did it is simply wrong.” Unless those analyses are made available, there’s no way for any observer to gauge how sensitive value-added results are to alternate, more complex specifications that controlled for, for example, peer effects, a longer test score history, and so forth.
That said, Buddin responds, “The Colorado study does not report the results of the [alternative value-added] model that they champion. They report no regression coefficients, no standards errors, and no indication of regression diagnostics.” Therefore, Buddin asks, “Why should we believe that [the alternative model] is better than [the L.A. Times model]?”
My take: It’s manifestly unclear, at least at this point, what the “best” models are. As Rick Hanushek notes wryly, “The real model isn’t known.” There are lots of ways to specify these models and that we’re learning as we go. This argues for everyone being transparent about their assumptions and for an appropriate dose of humility when using value-added data. And, to be fair, the standards for journalism and academic discourse are quite different. More than anything else, perhaps, this suggests that journalistic outlets need to move deliberately when wading into this space.
Third, Buddin points out that the actual data that he and Briggs used is different. For reasons that are unclear, Buddin notes that Briggs’ data includes “900 fewer teachers and 100,000 few student test scores.” There’s no way to be sure how much these extra exclusions might have impacted the results. When Briggs tells me that he was unable to replicate Buddin’s results, even when using the exact specifications that Buddin provided, I wonder how much of this can be attributed to this data question.
Finally, the academic-journalistic tension here is substantial, including questions about how the L.A. Times should cover analysis of its own work. I think Briggs makes a fair point when he says, “At a minimum, it should be clear that [Felch] mischaracterized the nature of our findings in his coverage of the report in the story released on Monday. Recall the headline for that story: ‘Separate study confirms many Los Angeles Times findings on teacher effectiveness.’ That’s just straight out bogus.” Like I’ve noted, I’m not sure I thought Briggs’s results were as damning as he does, but neither the findings nor the tenor of the analysis suggest the researcher thought that the takeaway was a confirmation of the Times‘ work. At the same time, it’s not as simple as that. Felch fairly points out that the Briggs results reinforce the L.A. Times‘ findings regarding the variability of performance across LAUSD teachers–perhaps the finding most relevant to public debate. And I’ve much sympathy for Felch’s argument that, “Part of this is a culture clash between journalism and academia. Briggs made no attempt to contact us before releasing his study, and simply assumed we had conducted no due diligence. The truth is quite the contrary. In journalism, we would never write a lengthy critique of someone’s work without picking up the phone to see if our assumptions were right.
Like I said, it strikes me that this is a case where both sides make some valid points. I think Buddin could’ve done more to explain his specifications and I would’ve liked to see the Times proceed more carefully, but I also think the Briggs critique is more cautionary than damning. So my takeaway: this is a case where I think the results mostly highlight the import of moving carefully and thoughtfully on value-added. That said, the standard in crafting value-added systems ought not be perfection, because nobody anywhere in the private or public sector has got a system that can meet the standard. The question is whether a given system is better than the alternative. And the truth is that today’s personnel systems are so insensitive to performance, so protective of mediocrity, and so dismissive of excellence, that value-added systems need not be flawless to be good and useful tools.
(And, for those who are wondering, my stance on the L.A. Times exercise hasn’t changed a bit. I’m still just where I was last summer: I think it’s good for policymakers and educators to be exploring ways to employ value-added in teacher evaluation and pay, but that it’s a mistake to publicly issue individual scores.)
-Frederick Hess