Reanalysis of the Miguel and Kremer deworming experiment

Introduction

A series of studies examining a deworming program in Kenya - starting with Miguel and Kremer 2004 - argues that deworming has long-term beneficial impacts, via higher school attendance and higher earnings. We have discussed these studies in some depth before, first here, and then more recently here. Over the past year, we have been seeking to understand these studies as well as we can because:

  • Along with Bleakley 2007, they represent the only evidence we know of that (a) has a plausible strategy for attributing long-term impacts to deworming; (b) argues for major measurable benefits for deworming.
  • We had reservations about these studies, and believed that a reanalysis of the raw data would either strengthen or weaken our reservations significantly.

Fortunately, the authors of Miguel and Kremer 2004 and Baird et al. 2012 have been kind enough to share the data and code from the original paper and the followup that we consider most significant, along with the survey instruments and part of the NIH grant application from the follow-up. This is especially generous on their part because Baird et al. 2012 is still a working paper.

Over the past few months, we have been looking closely at the data from these studies and running some analyses of our own. This work has increased our confidence in the findings of Miguel and Kremer 2004 and Baird et al. 2012, while also clarifying some of our remaining concerns. We believe that these studies are among the best examples we've seen of the potential value of pre-registration, and also among the leading candidates we have seen for replication.

In the remainder of this page, we describe the reservations that we were attempting to address and the steps that we took to do so.

Our initial reservations

We decided to seek out the raw data and code from these two papers because of the following reservations about their findings

  1. Reservations about potential data mining: the following factors in these studies led us to have reservations about potential data mining:

    • The definition of the treatment group changed across different studies. The underlying experiment gradually phased in treatment across three different groups. In Miguel and Kremer 2004, the main comparisons are between Group 1 and Group 2. (Group 3 did not begin receiving treatment during the window of the study.) In Baird et al. 2012, Group 1 and Group 2 together are the treatment group and Group 3 is the control, and in Baird 2007 (Baird's unpublished dissertation), the number of years a student was assigned to receive deworming is considered the treatment variable.
    • Baird et al. 2012 covers many outcomes, but only finds statistically significant improvements on some of them. Of the 40 main effects (i.e. not counting gender breakdowns) included in the key tables, 20 are statistically significant at the 10% level. This is substantially more than would be expected by chance, but it also struck us that the authors might have had some latitude in defining outcomes and choosing which ones to include in the paper.
    • There does not appear to be a clear pattern of significance in the externality terms. One of the most striking results from Miguel and Kremer 2004 was the importance of externality effects in health and school attendance outcomes (i.e. many of the benefits of treatment accrued to individuals who were not treated), but the pattern of externality effects in Baird et al. 2012 was less pronounced. This led us to wonder how likely the initially observed externality results would be to occur by mere chance.

    These opportunities for data mining are concerning because they increase the chances of a “false positive”: the more different ways there are to define the results of a study, the more likely some effects will appear to be statistically significant, even if there is no real underlying effect. We have written previously about why we find the potential for data mining concerning.

  2. External validity: Rates of heavy infection - in particular, rates of heavy schistosomiasis infection - were abnormally high in the experiment. This was due both to the location in which it was conducted and to the timing of the El Nino weather system. We wondered whether these results would generalize to other settings with more typical infection rates.

While issues with data mining and external validity are the main ones that concerned us, the headline finding of Baird et al. 2012, that receiving an extra two years of deworming in elementary school leads men in adulthood to work 3.4 more hours each week and earn 25% more income has continually struck us as counterintuitive. (Our prior is that such a strong effect would be unlikely for an intervention of this type.) Accordingly, we also tried to ascertain whether:

  • other treatments or events might explain the observed improvements in the treatment group
  • there were clear outliers that might be driving the observed effects.

Below, we describe the different analyses we undertook to address these issues. Throughout the writeup, we pay particular attention to the outcomes we considered most important prior to this investigation, as laid out at our writeup on deworming:

Baird et al. 2011 compared the first two groups of schools to receive deworming (as treatment group) to the final group (as control); the treatment group received 2.41 extra years of treatment on average. The study's headline effect is that those in the treatment group worked, and earned, substantially more, driven largely by a shift into the manufacturing sector. It also found a positive impact on meals consumed though not on overall consumption, small but non-statistically significant gains in school performance (though not on IQ), and gains on self-reported health though not on height or weight-for-height (and the treatment group had higher health expenditures).

Data mining

Treatment group definition

We re-ran the code for Baird et al. 2012 with the two other definitions of the treatment group that had been used in previous work on this experiment (Group 1 only and the number of years assigned to deworming). We find largely the same results regardless of how treatment is defined, though coefficients change, in some cases significantly. This spreadsheet (XLS) contains the results from Tables 2, 3, 5, and 6 of the most recent draft of Baird et al. 2012 (PDF), under both the author's definition of the treatment group (columns D-F) and our alternative specifications (columns H-N). (Results that are highlighted in light blue are statistically significant at the 10% level; results in darker blue are statistically significant at the 5% level; the miscarriage results are excluded because they ran in a different regression model.)1 These results increase our confidence in the main results of Baird et al. 2012, by showing that they are reasonably robust to plausible alternative specifications of the treatment group.

The number of years assigned to deworming struck us intuitively as the most logical way to define treatment, and that is what we focus on in the remainder of this analysis.

Multiple outcomes

Although there are statistical formulas for adjusting results to taking into account multiple comparisons, the process for determining what should count as a “comparison” is not straightforward. As we noted above, 20 out of 40 main effects included in the main tables of Baird et al. 2012 are statistically significant at the 10% level.2 However, as we wrote above, we find “the number of years assigned to deworming” the most intuitive definition of the treatment group; under that definition, 17/40 outcomes from the key tables were statistically significant.3

Some of the statistically significant outcomes were clearly selected from a larger universe of data. For instance, the tables include information about statistically significant changes in employment in manufacturing and casual labor, but not in the 18 other industry categories for which the authors collected and shared data with us. To try to adjust for some of the selection process between analysis and the creation of tables, we did a straightforward test in which we ran every outcome from the dataset shared with us (except miscarriages, which were formatted differently) against the main set of controls and our preferred treatment definition (years assigned to deworming). This resulted in 24/86 estimates that were statistically significant at p<.10 and 14/86 estimates that were statistically significant at p<.05 (~3x more than expected due to random chance in both cases).4

However, this strategy did not fully solve the problem of selection from a larger universe of data: because of the different ways variables could be constructed, some underlying variables continued to receive more weight than others. Employment in manufacturing and casual labor were each counted twice, for instance, because the authors created a second variable that was defined for each one over the whole population, while the data for employment in each of the 20 industries examined was only defined for the subpopulation that had a job. Similarly, there were multiple statistically significant variables covering different definitions of hours worked over different periods and in different subpopulations (e.g. cells E71, E77, and E86 of this spreadsheet).

Looking more closely at results that appear to show an effect makes sense if you're writing a paper, but it makes it difficult for us to determine how likely observed results are to have occurred by chance. In addition, some of the norms of academic publishing (e.g. length, “interestingness”) cause less than the full universe of studied results to be reported.

Accordingly, we asked the authors to share the underlying survey that was used to generate the data that they shared with us, and they did so, also sending us a portion of the NIH grant application used to fund the research and a previous draft of the Baird et al. 2012 working paper.

In the NIH grant application they shared, the authors write:

Some of the specific research questions to be examined in the proposed study include:

1.) Over ten years, do child health and nutritional gains from deworming:
a) Improve school attendance, test scores, and educational attainment?
b) Improve adult labor market outcomes, including wages and formal sector participation?
c) Increase entrepreneurial activity, change cropping patterns (for farmers), or affect individual participation in community self-help groups (e.g., informal credit groups)?
d) Improve adult cognitive performance, overall happiness and mental well-being?
e) Increase adult height, weight, self-reported health, or physical strength?
f) Affect the likelihood of migration to cities in search of employment?
g) Alter stated reproductive and marital intentions, such as intended number or timing of children, or planned age at marriage?
h) Alter reproductive and marital outcomes, including timing of first marriage and first birth, and spouse characteristics?
i) Improve the health outcomes of program beneficiaries’ children (e.g., infant mortality)? In other words, how persistent are health gains across generations?

This is the closest we believe we can come to a pre-registered plan for analyzing the results. We take this to show that there were a number of plausible headline outcomes that the authors surveyed, analyzed, and did not find statistically significant (including test scores, educational attainment, formal sector participation, entrepreneurial activity, cognitive performance, overall happiness, mental health, height or weight). In addition, some of the headline results that the authors did find, such as effects on meals eaten and hours worked, were not explicitly included in the list above.

Further complicating matters is the fact that there are many possible ways of constructing measures to get at each of the concepts above. In the dataset shared with us, there are eight different wage/salary indicators, along with another four measures combing wages and self-employment income, all of which might fit under the rubric of “wages,” above. There were also several indicators each for formal employment and hour worked, and another dozen for self-employment outcomes (including both profits and things like number of employees).(The estimated impacts on these indicators are included in this spreadsheet.  Indicators that are in some way related to income - wages/salary, self-employment earnings, or a combination – are in rows 161-199, 209-214 and 224-232; of 18 total indicators, 5 are significant at p<0.1 and 3 are significant at p<0.05.) The authors stated to us that labor market earnings, which accounts for several of the positive and statistically significant treatment effects described above, is the "canonical" measure for the field of labor economics.

Even if we knew the number of different specifications run, we believe that different indicators for the same outcome are likely to be correlated, but imperfectly and unpredictably, making effective statistical adjustments for multiple tests difficult.

We view this as an especially compelling case in which pre-registration, particularly pre-registration at the level of specific indicators tied to individual survey questions, would have substantially increased our confidence in a study. This is not to say that the authors have done anything unusual or wrong; our understanding is that preregistration was extremely rare (perhaps nonexistent) in economics at the time the study was done. In fact, Ted Miguel, one of the authors of the study, has gone on to co-author one of the first studies we've seen in this field that does utilize (and discuss) preregistration. Our point here is just that the practice of preregistration carries substantial credence benefits with us as consumers of research, and would affect our qualitative assessment of these findings.

Cross-school externality effects

In Miguel and Kremer 2004, much of the focus of the paper is on health and attendance externality effects across schools. In Baird et al. 2012, conversely, there does not seem to be a clear pattern of externality effects, and the externality terms are only significant at the 10% level in 9 of the 40 outcomes (see Column X of this spreadsheet; the miscarriages result is not included in the spreadsheet but is included in the 9 statistically significant outcomes). In addition, Baird et al. 2012 did not follow Miguel and Kremer 2004 in looking at externalities at a distance of both 0-3 and 3-6 km, instead collapsing the two into a single externality covering 0-6 km because they did not find meaningful differences between the two.5 (The authors have since told us that a later draft of the follow-up paper contains new analysis strengthening the case for cross-school externalities, by showing that although the externality terms are not consistently statistically significant, the level of correlation of the externality term signs and t-values with the main treatment effect signs and t-values is extremely unlikely to have occurred by chance.)

These divergences led us to want to look more closely at the externality effects in Miguel and Kremer 2004. The key tables with respect to cross-school externalities in Miguel and Kremer are Table 7, Table 9, and Table A3. Table 7 shows statistically significant externalities of deworming treatment within 3km and between 3 and 6 km for moderate-heavy schistosomiasis infections and for any moderate-heavy infections (including both schistosomiasis and soil-transmitted helminths), and an effect within 3km for moderate-heavy soil-transmitted helminth infections. Table A3 extends these results to an alternative formulation of the externality term (using a ratio of pupils treated rather than the absolute number), finding the same substantive result, and Table 9 shows a positive treatment externality within 3km (but not between 3 and 6 km) on school attendance.

We decided to test whether the externality effects from Miguel and Kremer were robust to alternative specifications. In particular, for each of the four outcomes (schistosomiasis infections, soil-transmitted helminth infections, any helminth infections [i.e. combining both schistosomiasis and STHs], and school attendance), we ran 52 specifications, covering 26 ranges of distances (e.g. within 1k, within 2k, between 3 and 6k) for both the absolute number and proportional approach to each externality. The externality effects on schistosomiasis infections and attendance appear to be relatively robust, while the externality effect on geohelminth infections does not. Out of 52 random uncorrelated outcomes we would expect 5.2  to have p<.1 and 2.6 to have p<.05. We found:

  • 35 have p<.1 and 33 have p<.05 in moderate-heavy schistosomiasis infections
  • 22 have p<.1 and 10 have p<.05 in any moderate-heavy infection
  • 6 have p<.1 and 4 have p<.05 in moderate-heavy geohelminth infections; and
  • 22 have p<.1 and 13 have p<.05 in attendance.

These results are summarized in this spreadsheet. We only tested combinations of distances up to 6km, but the possibility that the authors tested combinations above that level would increase the potential for data-mining. The specifications using absolute numbers treated generally found more statistically significant results than the specifications using proportions, but there was no radical difference.

These results are encouraging in that they show that the externality terms for schistosomiasis infections and school attendance are robust to some other specifications, though not to every other plausible specification.

External validity

The area in which the deworming experiment discussed in these papers was conducted had unusually high infection rates; in fact, infection rates rose substantially over the course of the study, as shown in Tables II and V of Miguel and Kremer 2004:

Measure Year 1 Prevalence Year 2 Prevalence
Moderate to heavy schistosomiasis infection 7% 18%
Moderate to heavy hookworm infection 15% 22%
Moderate to heavy roundworm infection 16% 24%
Moderate to heavy whipworm infection 10% 17%

The above table likely substantially understates the degree of change, because the second-year figure includes the benefits of treatment externalities experienced by the control group (discussed above). A calculation sent to us by the authors implied that the 18% prevalence of moderate to heavy schistosomiasis infection in the control group in year 2 shown above should be augmented by the 22 percentage point externality effect of treatment to get a genuine counterfactual infection rate of 40% - despite the fact that the initial prevalence was (as shown above) only 7%. This implies that without the program, the area would have seen an extreme rise in prevalence of moderate-to-heavy schistosomiasis infections. A footnote in Miguel and Kremer 2004 attributes this phenomenon to "the extraordinary flooding in 1998 associated with the El Niño weather system, which increased exposure to infected fresh water (note the especially large in-creases in moderate-to-heavy schistosomiasis infections), created moist conditions favorable for geohelminth larvae, and led to the overflow of latrines, incidentally also creating a major outbreak of fecal-borne cholera" (Pg 174).

Because of this unusual situation, we worry that the results of studies from this place and time may not generalize well to other circumstances in which rates are at lower, more typical levels. To follow up on this concern, we followed Baird et al. 2012 in running some specifications to determine whether treatment effects were stronger in areas with higher initial worm prevalence, and also simply ran the main models from the paper excluding the schools that received treatment for schistosomiasis (which seems to have been particularly affected by the unusual conditions, on the basis of the figures above). The results did not affect our confidence much in either direction. We saw some weak evidence that schistosomiasis, as opposed to the soil-transmitted helminths, might be the key factor.

To test whether treatment effects were stronger in areas with higher initial worm prevalence, we ran a specification that included baseline worm infection rates (at the zonal level) and an interaction term between baseline rates and treatment against all of the outcome data that was shared with us. These results are summarized in columns G-I of this spreadsheet. The interaction terms for treatment with baseline worm infection (combining STHs and schistosomiasis) are statistically significant at the 10% level in 27 out of the 86 outcomes examined (some redundantly, as discussed above), about three times as many as would be expected by chance, and more of the schistosomiasis-specific interaction terms are significant. However, there is no clear pattern between the interaction terms and the main effects; many of the interaction terms are significant in cases where the main effects are not, and vice versa.

In addition to running the specifications including interaction terms with baseline prevalence, we also ran the regressions from the main tables in Baird et al. 2012, simply excluding all the schools that received praziquantel (which is used to treat schistosomiasis and was used in this case only in schools with relatively high prevalence of schistosomiasis). These results are summarized in columns P-R of this spreadsheet.

Using this approach, we find fewer statistically significant outcomes: 12 out of 40 at the 0.1 level, and 8 out of 40 at the 0.05 level (including the effect on miscarriages, which is not in the spreadsheet). (Most of the results are unchanged, though the estimated impacts on income amongst those employed are reduced by about half and cease to be statistically significant.) This is still more than would be expected by chance, and it's important to keep in mind that excluding the schools we excluded reduces the total sample size of the study, making it more difficult for any given effect to be statistically significant.

Other potential issues

Alternative hypotheses

We considered a number of alternative hypotheses that might explain the observed results from this experiment. The one we consider most plausible, especially with respect to the school attendance effects, is that efforts to encourage students to attend school in order to receive treatment might have bled over to later days, increasing attendance in treatment schools over the following years. The particular piece of data that led us to examine this possibility is that the entire attendance effects of treatment appear to be at the school level, with no differential effect for students who actually received treatment: "Within-school participation externality benefits were positive and statistically significant at 99 percent confidence (5.6 percentage points) for untreated pupils in the treatment schools in the first year of the program (regression 5), and there is no significant difference in school participation rates between treated and untreated pupils in these schools” (Miguel and Kremer 2004 pg 196). The unpublished data usage guide shared by the authors says, "treatment dates were announced at the school in advance in an attempt to boost take-up," and the authors said in conversation that some efforts were undertaken to boost attendance on drug distribution days.6

Another possibility we have considered is that parents/students may have sought to switch into the schools that "won" early deworming treatments, which could cause the treatment and control groups to differ in ways not picked up by the baseline data measured by the studies. On this point, the authors stated to us that the deworming program was announced during the same term (term 1 of 1998) in which student names were collected and assigned to treatment/control groups. They also observed that transfers after treatments were announced began were relatively rare and didn't seem to systematically favor treatment over control schools (this is seen in Table IV of Miguel and Kremer 2004), making it less likely that students would have asymmetrically transferred into treatment schools prior to treatment actually being conducted. We now consider this concern far less likely.

Outliers

We conducted a visual examination of the data, looking for outliers at the school level, using these charts (big PDF). We collapsed outcome data for all of the outcomes shared with us by school, and then charted them by treatment group (randomization was at the school level and treatment was rolled out to Group 1 in the first year, Group 2 in the second year, and Group 3 in the fourth year). For each outcome shared with us, there are two charts: the "Outcome" chart, which is just the mean value of the outcome by school, and the "Residual" chart, which takes the mean residual by school of a regression of the outcome on the main set of controls (but not including any treatment indicator). The "Residual" chart basically uses the regression to soak up any variance explained by known factors other than treatment; in more technical terms, it represents the difference between the observed and the predicted values (without taking into account treatment status). A more positive residual in the treatment groups would imply that treatment caused a positive effect on the outcome (see the first two pages of the PDF, which are a made up example in which deworming has a large positive impact on the outcome in question).

Visual inspection of the charts did not raise any major issues for us, but we noted that five schools in particular appeared to have especially high manufacturing employment (which the authors suggest drives the income effect). To try to assess whether those schools might be driving the income effect, we ran the key tables from Baird et al. 2012 on a dataset that excludes the five schools with the largest manufacturing employment (results are in column T of this spreadsheet). The effect on manufacturing employment, along with nearly all of the other positive outcomes, continues to be positive and statistically significant (though one of the income effects does drop out of significance, without a substantial change in magnitude).

Based on these visual inspection results, it did not appear to us that outlier schools were driving any key results from Baird et al. 2012.

Sources

  • 1.

    The miscarriage regressions were run in a probit model, rather than the ordinary least squares models used for the other outcomes. The probit model does not work with the software used to export regression to Excel, so the miscarriage regression results are not included in the linked Excel files; they are described below and are saved in log files.

  • 2.

    See column D in this spreadsheet. The spreadsheet double-counts “total hours worked” because it appears in two different tables in Baird et al. 2012. and excludes the miscarriages outcome because it was run in a model that wasn't compatible with the software that exports to Excel. Fixing both of those issues returns 20 out of 40 statistically significant results.

  • 3.

    The spreadsheet double-counts “total hours worked” because it appears in two different tables in Baird et al. 2012. Counting it twice and excluding the miscarriages result, the spreadsheet shows 17 out of 40 results as statistically significant at the 10% level. Correcting for the double-counting, this goes to 16 out of 39, and then adding the miscarriage result back in (from the unpublished log file) gets 17 out of 40.

  • 4.

    See Column E of this spreadsheet for the results. Note that some of the results that are included in the tables in Baird et al. 2012 (and therefore in our summary spreadsheet) are for subgroups, whereas this spreadsheet includes only results for the entire population for which a given outcome is defined.

  • 5.

    “Miguel and Kremer (2004) also separately estimate effects of the number of pupils between 0-3 km and 3-6 km, but since the analysis in the current paper does not generally find significant differences in impacts across these two ranges, we focus on 0-6 km for simplicity.” Baird et al. 2012 pgs 18-19.

  • 6.

    Nekesa 2007, pg 14.

Clicky Web Analytics