Methodological Appendix for the Crystal Bridges Experimental Study

Education Next, Winter 2014

Empirical Strategy

Because the randomized controlled trial approach has the important feature of generating comparable treatment and control groups, we can use a straightforward set of analytic techniques, designed for use in social experiments, to estimate the impact of a school tour to an art museum on student outcomes. In its most simple form, this technique could estimate simple mean differences using the following equation for outcome Y of student i in matched pair m:

₍₁₎Y_{im =}α + β₁Treat_i+ β₂Match_{im +}ε_im

The binary variable Treat_iis equal to 1 if the student is in the treatment group that was randomly assigned to visit the museum for a school tour and is equal to 0 otherwise. Because the groups were created using a stratified randomization procedure within matched applicant group pairs, Match_im is also included in the model as a vector of dummy variables that have the statistical effect of estimating within, as opposed to across, matched pairs. Finally, ε_im is a stochastic error term clustered at the applicant group level to take into account the spatial correlation from students nested within applicant groups.

Proper randomization generates experimental groups that are comparable but not necessarily identical. The basic regression model can, therefore, be improved by adding controls for observable characteristics to increase the reliability of the estimated impact by accounting for minor differences and improving the precision of the overall statistical model. This yields the following equation to be estimated:

₍₂₎Y_{im =}α + β₁Treat_i+ β₂Match_im + β₃Gender_i+ β₄Grade_i+ ₊ε_im

where Gender_iis a dummy variable equal to 1 if the student is a female and 0 otherwise, and Grade_i is a vector of dummy variables indicating the grade level of student i. In this model, β₁ is the parameter of interest and represents the effect of a school tour for students in the treatment group. Equation (2) is our preferred model for estimating overall impacts.

In addition, we are interested in the possibility of heterogeneous effects on particular subgroups of students. Subgroup effects are estimated by augmenting the basic analytic equation with indicator variables and an interaction term where S_i indicates that a student is a member of a particular subgroup:

₍₃₎Y_{im =}α + β₁Treat_i+ β₂Match_im + β₃Gender_i+ β₄Grade_i+ β₅S_i + β₆S_i*Treat_i + ε_im

These models are used to estimate impacts on the separate components of the subgroups (e.g., impacts on minority and non-minority students separately) and test for the difference in impacts between the two groups. In our analyses, we examine the subgroup effects for students in schools that have higher (> 50%) or lower (< 50%) proportions of students who are FRL-eligible; students attending schools located in smaller towns (< 10,000 population) and larger towns (> 10,000 population); white and non-white students; and students making their first visits to the museum. When examining the impact of a first visit, we restrict our dataset to students in the treatment group who had only visited the museum once (i.e., on the school visit) and students in the control group who had never visited the museum. This excludes students who had been to Crystal Bridges outside of the school visit program prior to being surveyed.

Comparability of Treatment and Control Groups

Even within randomized controlled trials treatment and control groups may differ significantly from each other by chance. To explore whether that occurred in our experiment, we compare the observed characteristics of treatment and control group students. We find no significant differences on observed characteristics.

Different outcomes in our study are based on different samples. The tolerance and historical empathy outcomes are based on items included in the survey administered to students during the spring of 2012. The critical thinking measure is based on an exercise given to students during the fall of 2012. And the interest in art museums measure is based on items in surveys given to students during both semesters. The demographic characteristics for all three samples (Spring 2012, Fall 2012, and combined) are presented below in appendix tables 1, 2, and 3. None of the 27 differences between the observed characteristics of treatment and control group students presented in those tables is statistically significant at the conventional p <.05 level. The town population in the Spring 2012 sample differed at the p<.10 level, but with 27 comparisons finding one such difference could occur by chance. We conducted joint F-tests for all three samples and found that, taken as a whole, the characteristics of our treatment and control groups did not differ significantly from each other.

We also administered a different survey to students in grades Kindergarten through 2nd grade. We collected fewer descriptive characteristics about the K-2nd grade sample, but as shown in appendix table 4, we find no significant differences between the younger treatment and control group students either.

Appendix Table 1: Treatment/Control Balance of the Spring 2012 Sample, Grades 3-12

Characteristic	Treatment (n = 1,899)	Control (n = 2,106)	Difference
Percent females	53.25	51.52	1.72
Percent white	62.82	59.73	3.09
Percent Hispanic	19.06	19.85	-0.79
Percent black	2.95	4.70	-1.75
Percent other	15.71	15.72	-0.55
School % FRL	50.44	52.73	-2.29
Average grade	6.06	6.09	-.02
Miles from museum	35.23	37.10	-1.86
Town population	40,157	55,654	-15,497*

^* p < .10, two-tailed.

Appendix Table 2: Treatment/Control Balance of the Fall 2012 Sample, Grades 3-12

Characteristic	Treatment (n = 1,860)	Control (n = 2,431)	Difference
Percent females	50.76	50.62	0.14
Percent white	55.65	60.59	-4.95
Percent Hispanic	18.49	17.52	0.97
Percent black	2.69	2.96	-0.27
Percent other	23.17	18.92	4.25
School % FRL	58.10	58.56	-0.46
Average grade	5.75	5.57	0.19
Miles from museum	34.90	43.64	-8.73
Town population	39,675	31,537	8,138

Appendix Table 3: Treatment/Control Balance of the Combined Sample, Grades 3-12

Characteristic	Treatment (n = 3,759)	Control (n = 4,537)	Difference
Percent females	52.02	51.04	0.98
Percent white	59.27	60.19	-0.92
Percent Hispanic	19.06	19.85	0.79
Percent black	2.82	3.77	-.95
Percent other	19.13	17.43	1.69
School % FRL	54.20	55.86	-1.66
Average grade	5.91	5.81	0.10
Miles from museum	35.07	40.60	-5.53
Town population	39,919	42,732	2,813

Appendix Table 4: Treatment/Control Balance of the Combined Sample, Grades K-2

Characteristic	Treatment (n = 1,445)	Control (n = 1,189)	Difference
Percent females	47.71	48.86	-1.15
School % FRL	41.66	55.56	-13.90
Average grade	1.22	1.44	0.40
Miles from museum	14.95	22.43	-7.48
Town population	43,356	37,813	5,543

Critical Thinking Skills Inter-Coder Reliability

Our measure of critical thinking skills was developed and validated by Adams, Foutz, Luke, and Stein (2007) in their study of the School Partnership Program at the Isabella Stewart Gardner Museum in Boston. Students in 3rd through 12th grade during the Fall of 2012 were shown a copy of Bo Bartlett’s painting, The Box. Students were asked to write a short essay in response to the questions: “What do you think is going on in this painting?” and “What do you see that makes you think that?” Their answers were scored blindly by one of two researchers with the two researchers overlapping in their coding of 750 of the responses.

The critical thinking measure is based on the number of instances that students engaged in the following in their essays: observing, interpreting, evaluating, associating, problem finding, comparing, and flexible thinking. Our measure of critical thinking is the sum of the counts of these seven items.

Based on the sample of 750 essays scored by two researchers, we are able to calculate inter-coder reliability. Our researchers were highly consistent in their scoring of the combined critical thinking score as well as on almost all seven components of that score. As can be seen in appendix table 5, the Cronbach’s Alpha for the composite critical thinking score is .94. For the components, the Cronbach’s Alpha was between .73 and .85 for five of the seven items. Inter-coder reliability was weaker when scoring problem finding and comparisons, but those components were only displayed rarely and make little difference to the composite score.

Appendix Table 5: Inter-Coder Reliability for Critical Thinking Items

Item	Average (Std. Dev.)	Cronbach’s Alpha
Composite (Sum of 7)	8.16 (3.85)	0.94
Observation	3.97 (2.40)	0.85
Interpretation	3.90 (2.35)	0.82
Evaluation	0.02 (0.18)	0.75
Association	0.06 (0.25)	0.73
Problem Finding	0.01 (0.12)	0.44
Comparison	0.02 (0.15)	0.20
Flexible Thinking	0.17 (0.43)	0.82

Outcome Scales

We asked multiple items to measure tolerance, historical empathy, and the extent to which students were developing interest in art museums. Because we had theoretical reasons for expecting that these items measured the same underlying constructs and for brevity of presentation, we standardized and combined items into three scales representing each of these constructs. Cronbach’s Alpha tests show that the items reliably measure historical empathy and developing interest in art museums. The Cronbach’s Alpha for the tolerance scale, however, falls short of conventional standards for reliably measuring the same underlying construct. We nevertheless present the tolerance result as a combined scale for a few reasons. Presenting the four items in the scale separately would still show a positive relationship between school tours and tolerance and would just be less parsimonious. The consistent empirical result and our theoretical expectation that these items measure the same construct overcome our concerns about a weaker than normal Cronbach’s Alpha.

Appendix Table 6: Cronbach’s Alpha for Outcome Scales

Scale	Number of Items	Cronbach’s Alpha
Tolerance	4	0.40
Historical Empathy	3	0.65
Cultural Consumer	8	0.90

Methodological Appendix for the Crystal Bridges Experimental Study

Latest Issue

NEWSLETTER

Business + Editorial Office

Discover

More Information