When Principals Rate Teachers
The best—and the worst—stand out
Elementary- and secondary-school teachers in the United States traditionally have been compensated according to salary schedules based solely on experience and education. Concerned that this system makes it difficult to retain talented teachers and provides few incentives for them to work to raise student achievement while in the classroom, many policymakers have proposed merit-pay programs that link teachers’ salaries directly to their apparent impact on student achievement.
Until recently, only a handful of isolated districts had attempted such programs. Now entire state systems are moving toward merit pay, with new policies established recently in Florida and Texas requiring districts to set teachers’ salaries based in part on the gains their students are making on the state’s accountability exam.
Implementing a merit-pay system, however, comes with challenges. Students often have more than one teacher but take only one high-stakes test. How do we know which teacher to reward? If students are not tested annually in each subject, how do we determine the merit of a teacher in a year without testing? How do we fairly assess the impact of a teacher during a testing year if we do not know how students performed during the previous school year? Can a merit-pay system overcome these obstacles?
One option is to turn to principals and ask them to help determine the size of pay raises. Such subjective performance assessments are already used to evaluate untenured teachers, and they play a large role in promotion and compensation decisions in other occupations. While principals can and do judge teachers’ performance, however, there is little good evidence on the accuracy of their judgments.
The research reported in this paper fills this gap. We found that principals in a western school district did a good job of assessing teachers’ effectiveness. In fact, principals are quite good at identifying those teachers who produce the largest and smallest standardized achievement gains in their schools (the top and bottom 10–20 percent). They are less able to distinguish among teachers in the middle of this distribution (the middle 60–80 percent), suggesting that merit-pay programs that reward or sanction teachers should be based on evaluations by principals and should be focused on the highest- and lowest-performing teachers.
A Representative Sample
We surveyed all 13 elementary-school principals in a midsized school district, that asked to remain anonymous, in the western United States. We asked them to rate the teachers in their schools on a variety of performance dimensions. The survey, conducted in February 2003, provides evaluations by their principals of 202 elementary-school teachers in grades 2 through 6.
The teachers included in the study are fairly representative of elementary-school teachers nationwide. Sixteen percent of them are men, the average age is 42, and average teaching experience is 12 years. Most of these teachers attended a local university; 10 percent attended another in-state college; and 6 percent attended a school out of state. Seventeen percent of them have a master’s degree or higher, and most are licensed in either early childhood education or elementary education. Finally, 8 percent of the teachers in our sample taught in a mixed-grade classroom in 2002–03, and 5 percent were in a “split” classroom, sharing a single contract and dividing the school day with another teacher. The students in grades 2 through 6 in the district are predominantly white (73 percent), with a sizable ethnic minority (Latino students compose 21 percent of the elementary population) ; 48 percent of them receive a free or reduced-price lunch. Achievement levels in the district are almost exactly at the average of the nation (49th percentile on the Stanford Achievement Test).
All elementary-school students in the district take a set of exams each year, in reading and math. These multiple-choice, criterion-referenced tests cover topics that are closely linked to the district’s learning objectives. While student achievement results have not been linked to rewards or sanctions for schools until recently, the results of the exams have been distributed to parents annually for at least the past decade, years before implementation of the No Child Left Behind law. This latter fact is important because our study relies on a consistent data set covering the years 1998 through 2003. The district has not had a merit-pay program for teachers at any time during this period.
To ensure that we could link student achievement data to the appropriate teacher, we limited our sample to classroom teachers, omitting music and gym teachers as well as librarians. We excluded kindergarten and first-grade teachers because earlier achievement exams were not available for their students; this prevented us from developing a “value-added” measure of student learning. We retain in our analysis the small number of teachers who share a contract, each teaching only half of the school day. For our analysis, the gains made by students in these classes count toward the estimated value added of each of the two teachers.
Can Principals Identify Effective Teachers?
Principals were asked not only to provide a rating of overall teacher effectiveness, but also to assess, on a scale from one (inadequate) to ten (exceptional), specific teacher characteristics (ten altogether), including dedication and work ethic, classroom management, parent satisfaction, positive relationship with administrators, and ability to improve math and reading achievement. Principals were assured that their responses would be completely confidential and would not be revealed to the teachers or to any other employee of the school district.
While there was some variation among principals, the overall assessments they gave teachers were generally quite high, with an average of 8.1. Only 10 percent of the assessments fell below a 6, and the average rating for the least-generous principal was still a 6.7. At the same time, principals did not simply assign similar scores to each of their teachers. In fact, the principals generally used 5 to 6 different ratings for the teachers in their school.
Because principals differ in the generosity and degree of variation in the ratings they give, we placed all the ratings on the same scale by subtracting from each teacher’s rating the average rating given by that teacher’s principal and then dividing by the principal’s standard deviation. We did this separately for each specific aspect of teacher performance about which principals were asked.
We compared a principal’s assessment of how effective a teacher is at raising student reading or math achievement, one of the specific items principals were asked about, with that teacher’s actual ability to do so as measured by their value added, the difference in student achievement that we can attribute to the teacher. To estimate the value added by a teacher, we examine the performance of her students after accounting for a wide variety of student and classroom characteristics that could affect achievement independent of the teacher’s ability. These characteristics include race, gender, eligibility for the federal lunch program, limited English proficiency, and, most important, previous student achievement. We also take advantage of the availability of data on the same teachers from as far back as the 1996–97 school year; this enables us to distinguish long-term teacher quality from the possibly idiosyncratic performance of a class in any one year.
We find a positive correlation between a principal’s assessment of how effective a teacher is at raising student achievement and that teacher’s success in doing so as measured by the value-added approach: 0.32 for reading and 0.36 for math. These correlations are based not on a principal’s overall rating of the teacher, but rather on the principal’s personal assessment of how effective the teacher is at “raising student math (or reading) achievement. ” Previous studies of evaluations by principals have used only the overall rating of the teacher, a less direct assessment of a teacher’s ability to raise student performance. Using the overall rating in that way could compromise the accuracy of subjective performance evaluations, especially if principals value characteristics of teachers that are unrelated to their effect on student performance. Our findings lead us to conclude that principals are able to identify accurately this dimension of teacher effectiveness.
Why aren’t these correlations even higher? One possible explanation is that principals focus on the average test scores in a teacher’s classroom rather than on student improvement. There is some evidence for this conjecture. The correlation between ratings by principals and the average test scores of a teacher’s students is significantly higher than the correlation between ratings by principals and the teacher’s value-added rating in reading (0.56 versus 0.32), though not in math.
Another reason could be that principals focus on their most recent observations of teachers. We do find, for example, that the average achievement gains in a teacher’s classroom in 2002–03 is a modestly stronger predictor of the principal’s rating than the gains in any previous year. In theory, it is possible that principals are correct in assuming that a teacher’s effectiveness changes over time so that teachers’ most recent experience is the best indicator of their actual effectiveness. If that were the case, however, we would expect to find that principals’ ratings are more highly correlated with value-added measures that have been adjusted to account for the fact that teachers tend to be less effective in their first one or two years in the classroom. In fact, the correlation between principals’ ratings and experience-adjusted value-added measures is no higher than the correlation with our baseline value-added measures. The bigger mistake principals make, it seems, is not adequately accounting for students’ incoming ability.
While informative about principals’ overall abilities, a simple correlation does not tell us whether principals are more or less effective at identifying teachers at certain points on the ability distribution. We therefore estimated the percentage of teachers that a principal can correctly identify in the top group within his or her school. We found that the teachers identified by principals as being in the top category were, in fact, in the top category according to the value-added measures about 52 percent of the time in reading and 69 percent of the time in mathematics. If principals randomly assigned ratings to teachers, we would expect the corresponding probabilities to be 14 and 26 percent, respectively. This suggests that principals have considerable ability to identify teachers in the top of the distribution. The results are similar if one examines principals’ ability to identify teachers in the bottom of the ability distribution.
Despite their success with the top and bottom of the distribution, principals are significantly less successful at distinguishing among teachers in the middle of the ability distribution. Principals correctly identify only 49 percent of teachers as being better than the median teacher in their school in boosting students’ reading scores, relative to the 33 percent that one would expect if principals’ ratings were randomly assigned. Principals appear somewhat better at distinguishing between teachers in the middle of the distribution in math (they correctly placed 54 percent of teachers above the median, compared with the 26 percent expected if ratings were random), but they again appear to be better at identifying the best and worst teachers.
One reason that principals might have difficulty distinguishing between teachers in the middle is that the distribution of teachers’ value-added ratings is highly compressed. However, our analysis of the data suggests that this is not the case. Teachers who receive ratings at or close to the median in the school have estimated value-added measures that are quite widely dispersed.
What Characteristics of Teachers Do Principals Value?
Of course, the effects of moving to a system of compensation based on assessment by principals depend on the relative importance they place on a teacher’s ability to raise standardized test scores when making overall assessments of teachers’ effectiveness. While such preferences could theoretically be set by district administrators or other policymakers, it is likely that principals would retain some autonomy over personnel decisions, so their preferences are important to investigate. We therefore compared principals’ overall rating of each teacher with their assessment of various teacher attributes to examine how principals value different dimensions of quality in teachers.
Perhaps not surprisingly, teachers’ ratings on many (though not all) of the individual survey items are highly correlated. Based on the relationships between the questions, we created three groups of teachers’ quality characteristics and reanalyzed the results. The first group captures what might be described as traditional teaching ability and includes the ratings of classroom management, organization, and ability to improve students’ test scores. The second, including the principal’s assessments of a teacher’s relationship with colleagues and administrators, measures a teacher’s collegiality. The third measures student satisfaction and includes the principal’s ratings of student satisfaction and the teacher as a role model.
Ability, collegiality, and student satisfaction all contribute independently to a principal’s overall evaluation of a teacher, but principals weigh the set of questions measuring teachers’ ability to improve student achievement and to manage a classroom most heavily. An increase of one standard deviation in a principal’s evaluation of a teacher’s management and teaching ability, for example, is associated with an increase of 0.56 standard deviations in the principal’s overall rating. In comparison, an increase of one standard deviation in teacher collegiality is associated with an increase in overall ratings of roughly one-third of a standard deviation in overall rating. Meanwhile, teachers scoring one standard deviation higher in student satisfaction score just 0.15 standard deviations in their overall rating, all else being equal.
We should care about the quality of principals’ assessments of teacher quality not just for their reliability in a merit-pay system, but also for their ability to identify teachers who will continue to improve student achievement. In order to get a sense of how well principals’ assessments forecast teachers’ performance, we examined how well these assessments predict future student achievement gains. For our February 2003 survey of principals, that meant evaluating scores on the spring 2003 tests. We compared the predictive accuracy of a principal’s assessment of teacher effectiveness with the predictive accuracy of a teacher’s value-added rating. We also measured the accuracy of the traditional determinants of teachers’ salaries, experience and education, in predicting those scores. Throughout, we accounted for differences in previous student achievement, student demographics, and classroom characteristics.
Our findings suggest that ratings by principals, both overall ratings and ratings of a teacher’s ability to improve achievement, effectively predict a student’s future achievement gains (see Figure 1). Students whose teachers receive an overall rating one standard deviation above the mean are predicted to score roughly 0.06 standard deviations higher in reading than students whose teacher received an average rating. By way of comparison, students receiving free or reduced-price lunch in the same district experience achievement gains approximately 0.16 standard deviations lower than similar students who are not eligible for such programs. Assignment to a teacher with a favorable evaluation by her principal appears to be more important for math performance. An increase of one standard deviation in the principal’s evaluation predicts an increase of 0.14 standard deviations in math performance, roughly on par with the disadvantage associated with coming from a low-income family.
Measures of teachers’ value added in previous years are an even better predictor of future gains in students’ achievement than are principal ratings. These results, which are similar for math and reading, suggest that teachers’ impact on student achievement, as measured by simple value-added measures of teacher effectiveness, remain fairly stable over time and that principals’ ratings effectively capture a substantial fraction of these stable differences in teachers’ effectiveness.
We do not find any statistically significant relationship between the number of years a teacher has taught and students’ achievement, though this is probably due to the necessary omission of first-year teachers (because we cannot measure their value added for a previous school year). Other studies have found that first-year teachers tend to perform worse on average than experienced teachers. Education does have some predictive power. Teachers with advanced degrees have students who score roughly 0.10 standard deviations higher. We hesitate to say that education itself is producing these gains, because a teacher’s level of education is likely to be associated with personal characteristics not accounted for in our analysis, and these may be the very factors responsible for the improvements in student achievement.
Perhaps our most interesting finding is that the salaries teachers in this district received in 2002–03 bore no relation at all to their impact on student achievement. Students with highly paid teachers made no more progress than those with teachers who had low salaries.
In sum, our results suggest that student achievement (as measured by standardized test scores) would probably improve more under a system based on principals’ assessments than in systems where compensation is based solely on education and experience. This is because principals would be able to identify and reward the very best teachers while, at the same time, identifying the least competent teachers for remediation or dismissal.
To the extent that the most important staffing decisions involve sanctioning incompetent teachers and rewarding the very best teachers, a principal-based assessment system may affect achievement as positively as a merit-pay system based solely on student test results. Moreover, evaluation by the principal has the potential to offset some of the potential negative consequences of test-based accountability systems. If principals can observe inputs as well as outputs, they may be able to ensure that teachers increase student achievement through improvements in pedagogy, classroom management, or curriculum rather than teaching to the test. Principals can also evaluate teachers on the basis of a broader spectrum of educational outputs in addition to test scores that parents may value. At the same time, the inability of principals to distinguish between a broad middle range of teacher quality suggests caution in relying on principals for fine-grained performance determinations, as might be required under certain merit-pay policies.
Two important caveats to consider when interpreting our results. First, we conducted our analysis in a context where principals were not being evaluated on the basis of their ability to identify effective teachers. It is possible that principals’ ability to identify the best-performing teachers would be enhanced by a school system where the principals had more responsibility for monitoring teachers’ effectiveness. At the same time, social or political pressures might make principals less willing to assess teachers honestly if their judgments directly influenced teachers’ compensation. Second, our analysis focuses on the source of the teacher assessment; we do not address the type of rewards or sanctions associated with teacher performance. This is clearly an important dimension of any performance management system, and one would not expect either a principal-based or a test-based assessment system to have a substantial impact on student outcomes unless it were accompanied by meaningful consequences.
Brian Jacob is assistant professor of public policy at the John F. Kennedy School of Government, Harvard University and a faculty research fellow with the National Bureau of Economic Research. Lars Lefgren is assistant professor of economics, Brigham Young University.