My good friend Jay Greene is back this week with yet another assault on the Gates Foundation and its Measures of Effective Teaching (MET) project, the final results of which appeared on Tuesday. Jay accuses the foundation of failing to disclose the limited power of classroom observation scores in predicting future test score gains over and above what one would predict based on value-added scores alone. In fact, he goes so far as to imply that classroom observations are not predictive at all, rendering them useless as a source of diagnostic feedback. His arguments on these points are not compelling. (Full disclosure: MET principal investigator, Tom Kane, is a senior colleague of mine at the Harvard Graduate School of Education – do with that what you will.)
First, the idea that Gates has somehow suppressed information on the predictive power of observations is downright silly. In addition to featuring relevant results in Table 1 of the accompanying technical report (which Jay obviously had no trouble finding), the second “key finding” in the summary document states that “The composite that best indicated improvement on state tests heavily weighted teachers’ prior student achievement gains based on those same tests.” In other words, if all one were trying to do is to predict gains on state tests, one would use an evaluation system that places a great deal of weight – perhaps as much as 80 percent, we learn from Figure 3 – on value added.
MET argues for a more balanced set of weights among value added, classroom observations, and feedback from student surveys on other grounds. When you move toward a more balanced set of weights you lose correlation with state tests, but you gain two things (at least initially): (1) modestly higher correlation with gains on tests designed to measure “higher order” thinking and (2) far higher reliability. To their credit, however, the MET researchers also note that one should not go too far. They explain that if you assign less than 33 percent of the weight to value added, you end up LOSING not only correlation with state test score gains, but reliability and correlation with those other tests as well.
Now, most of the gains in reliability from a more balanced set of weights come from the student survey feedback. The real argument for investing in classroom observations is their potential diagnostic value. Jay dismisses this possibility on spurious grounds. He claims that “if they are not useful as predictors, they can’t be used effectively for diagnostic purposes.”
But the report shows that they are predictive of test score gains. In Table 1 of the technical report (on which Jay bases his critique), the MET team uses evaluation measures from 2009-10 to test their ability to “post-dict” teachers’ effectiveness the previous year. Columns (4) and (8) report results using observations alone; in three of the four tests (the exception is elementary grades ELA), observations are statistically significant predictors. This, however, is not their strongest evidence.
The single most important contribution of the MET study is its use of random assignment (of teachers to classrooms) to validate their overall effectiveness measure and its constituent parts. Table 10 confirms that classroom observations pass this test with flying colors. (So do value added and student surveys.) So when they directly test Jay’s argument—whether observations predict student achievement gains—the find that the answer is yes—whether when “post-dicting” gains on a non-experimental basis (Table 1) or when predicting gains following random assignment (Table 10).
Does this prove that the information observations provide can be used to improve teacher effectiveness? Of course not. And certainly they don’t yet (and likely never will) provide the detailed guidance that Jay, in a follow-up post, faults Bill Gates for promising. But we do have an existence proof in the form of a recent paper by Stanford’s Eric Taylor and Brown’s John Tyler, which shows that veteran teachers in Cincinnati improved after undergoing an intensive observation-based evaluation program. It should also be noted that the MET results are all based on existing off-the-shelf observation protocols. At least in theory, these protocols could be refined over time to improve their predictive power and diagnostic value.
Jay’s clearly right about one thing: observations are costly. He may even be right that these costs outweigh the benefits, but we do not know enough to say and that is not what he argued. (A focus of Kane’s ongoing work is figuring out how to use technology to bring the costs down while still generating reliable information.) I also share Jay’s broader concerns about the ongoing effort to prescribe mechanistic teacher evaluation systems on a district- or state-wide basis. This effort may well turn out to be an “expensive flop.” But that is hardly the right descriptor for the MET project.
-Martin West