This page discusses the nature and extent of two common problems we see with formal evaluations: selection bias and publication bias.
We believe that these problems tend to skew evaluations of non-profits in the positive direction and that these problems are partially mitigated by the use of the randomized controlled trial (RCT) methodology (and can be mitigated by other techniques as well).
Below, we discuss both selection bias and publication bias: what they are, what sort of skew they are likely to bring about, why randomized controlled trials may suffer less from these issues, and what evidence is available on the extent and nature of these issues.
We then note two highly touted and studied interventions - microlending and Head Start - for which initial studies (highly prone to selection bias and publication bias) gave a much more positive picture than later results from randomized controlled trials. (More) We see these cases as suggestive evidence for the view that lower-quality studies tend to give an exaggerated case for optimism about effectiveness.
Studies of social programs commonly compare people who participated in the program to people who did not, with the implication being that any differences are caused by the program. (Some studies report only on improvements or good performance among participants, but even in these cases there is often an implicit comparison to non-participants - for example, an implicit presumption that non-participants would not have shown improvement on the reported-on measures.)
However, program participants are different from non-participants, by the very fact of their participation. An optional after-school tutoring program may disproportionately attract students/families who place a high priority on education (so its participants will have better reading scores, graduation rates, etc. than non-participants even if the program itself has no effect); a microlending program may disproportionately attract people who have higher incomes to begin with; etc.
Selection bias may skew a study in a positive or negative direction. Say that an after-school program targets struggling schools; in this case, comparing its participants to "average" students all across the city may be overly unfavorable to the tutoring program (since its students likely do not score as well as students in better-off schools), but comparing its participants to other students at the same schools may be overly favorable (since, as discussed above, its participants may tend to place higher priority on education).
One of the reasons we are concerned about selection bias is because it gives the researchers substantial room for judgment calls in their choice of comparison group. When it comes to studies on non-profits' impacts, we believe that researchers generally prefer to present the programs in a positive light, and thus tend to choose comparisons that favor the programs (more on this below under "Publication bias"). Thus, we feel that selection bias is generally likely to skew apparent results in favor of non-profits' programs.
Certain study designs are much less vulnerable to selection bias than others. A randomized evaluation,1 also known as a randomized controlled trial, generally avoids the problem of selection bias by using random assignment to assign some people and not others to a program; then people who were "lotteried in" (randomly assigned) to the program are tracked and compared to people who were "lotteried out." Intuitively speaking, this methodology seems to significantly reduce the risks that there will be any systematic differences between program participants and non-participants, other than whether they participated in the program.
There are other ways of addressing the problem of selection bias, fully or partially. Speaking broadly, we feel that randomization is the single most reliable indicator that a study's findings can be interpreted without fear of selection bias, and most of the studies we refer to as "high-quality" involve randomization. However, there are studies we consider "high-quality" that do not involve randomization, such as the impact evaluation of VillageReach's pilot project.
Peikes, Moreno, and Orzol (2008) evaluated the impact of the US State Partnership Initiative employment promotion program, using two methods: (a) a randomized controlled trial, with very low vulnerability to selection bias (see discussion above regarding randomization); (b) propensity-score matching, a relatively popular method for attempting to simulate a comparison between program participants and identical non-participants without the benefit of randomization (using available observable characteristics of participants and non-participants).2 Despite "seemingly ideal circumstances" for method (b),3 the two methods produced meaningfully different results: in two of the three locations, method (b) implied large, positive, statistically significant impacts of the program on earnings, while method (a) implied negative, non-statistically significant impacts of the program on earnings.4 The authors concluded:5
In this case, the attempt to compare program participants to similar non-participants using observable characteristics (which is what method (b) relied on) implied that the participants earned much more than non-participants; however, comparing lotteried-in to lotteried-out people showed no such thing. This implies that there were unobservable ways in which participants differed from non-participant, ways that were significant enough to create the illusion of a strong program effect.
We conducted a search for literature reviews of studies directly comparing the results of randomized and non-randomized estimates of social programs' effects. The most complete and recent literature reviews are summarized here.
Our overall take on these studies is that they (a) focus on the best-designed non-randomized studies; (b) show mixed results, and give substantial reason for concern that non-randomized studies' results can diverge significantly from randomized studies' results.
They do not show that selection bias systematically skews results in one direction or another; they do show that the presence of selection bias introduces a substantial source of skew. We believe that in the case of programs run by non-profits this skew is likely to be positive more often than negative.
Review 1: Glazerman, Levy, and Myers (2003). This review examines twelve studies "in the context of welfare, job training, and employment services programs."6 Each of the studies estimates a program’s impact by using a randomized controlled trial, and separately estimates the impact by using one or more nonrandomized methods.7 Each of the programs aimed to raise earnings.8
Review 2: Bloom, Michalopoulos, and Hill (2005) reviews the question of randomized vs. nonrandomized evaluation in a variety of sectors.
Review 3: Cook, Shadish, and Wong (2008) analyzes twelve within-study comparisons of randomized and nonrandomized methods.19 These twelve comparisons are from ten publications, and span a variety of social programs, mainly from the US. The publications are not included in Glazerman, Levy, and Myers (2003), and only one is included in Bloom, Michalopoulos, and Hill (2005).
Cook, Shadish, and Wong (2008) finds that in two of the comparisons, the nonrandomized method sometimes achieves the same result as the randomized method and sometimes does not; in two other comparisons the nonrandomized methods fail to achieve the same result; and in the other eight comparisons the nonrandomized methods replicate the randomized methods reasonably.20 The review concludes that "the strong but still imperfect correspondence in causal findings reported here contradicts the monolithic pessimism emerging from past reviews of the within-study comparison literature."21 However, the review explicitly concentrates on the "best possible [nonrandomized] design and analysis practice,"22 and also states:
"Publication bias" is a broad term for factors that systematically bias final, published results in the direction that the researchers and publishers (consciously or unconsciously) wish them to point.
Interpreting and presenting data usually involves a substantial degree of judgment on the part of the researcher; consciously or unconsciously, a researcher may present data in the most favorable light for his/her point of view. In addition, studies whose final conclusions aren't what the researcher (or the study funder) hoped for may be less likely to be made public.
As discussed below, the existing literature on publication bias often concludes that studies are skewed toward showing (a) more "surprising" findings; and (b) more "positive" findings (indicating that medical treatments, social policies, etc. "work").
We have not identified any studies specifically on publication bias in evaluations of non-profit programs, but we would guess that these studies would be skewed to the optimistic side, simply because the non-profits cooperating in the studies and the funders paying for them have incentives to portray their work in a positive light, and we know of no study funders or implementers with incentives to skew results in the pessimistic direction.
We are less concerned about publication bias in studies that have the following qualities, in descending order of importance:
We have not seen systematic investigations of the hypotheses laid out above.
We have not yet conducted a systematic review of literature on publication bias, but we have come across several studies on the subject.
Medicine. Hopewell et al. (2009) reviewed five studies examining patterns in which clinical trials did and didn't have their results published in medical literature:
Ioannidis (2005a and 2005b) explored the magnitude of the problem and concluded that from both a theoretical and empirical perspective, there is reason to be skeptical of much (even most) of the conclusions published in medical literature.25 These studies also provide some loose arguments that studies with less flexibility, particularly randomized controlled trials, are likely to be less susceptible to these issues.26
Economics. De Long and Lang (1992) gives some evidence for a broad form of publication bias in the field of economics. It examines published papers that fail to reject their central "null hypothesis" (the "null hypothesis" generally referring to a "general or default position, such as that there is no relationship between two measured phenomena"27) and finds that an aggregate analysis of these papers' results suggests that the individual results are erroneous - i.e., most or all of the central "null hypotheses" that the papers fail to reject are in fact false. It concludes that the best explanation for this phenomenon involves publication bias: papers rejecting their central "null hypothesis" are not published without prejudice, but rather published largely (or only) when their rejection is "exciting."28
Publication bias in more narrow topics.
If publication bias is a real and significant problem, this could be expected to imply that studies of social programs will tend to exaggerate the programs' impact - especially studies that are prone to selection bias and otherwise leave significant room for judgment calls on the part of the researchers. This idea is similar to Rossi (1987)'s "Stainless Steel Law of Evaluation," which is the proposition that:32
This law means that the more technically rigorous the net impact assessment, the more likely are its results to be zero—or no effect. Specifically, this law implies that estimating net impacts through randomized controlled experiments, the avowedly best approach to estimating net impacts, is more likely to show zero effects than other less rigorous approaches.
We have encountered no formal empirical study of this "law."33 We believe it to be valid based partly on our own experience of reviewing the highest-quality academic literature we can find compared with our experience of reviewing evaluations submitted by non-profits; we intend to document this comparison more systematically in the future.
Here, we discuss two cases that we believe provide suggestive evidence for the above proposition: microlending and Head Start. In both of these cases, we are able to compare a systematic overview of relatively low-quality studies (i.e., highly prone to selection bias, and with substantial room for judgment in their construction) to later evidence from randomized controlled trials. In both of these cases, the earlier, lower-quality research presents a much more optimistic picture than the randomized controlled trials.
Microlending, the practice of making small loans to low-income people (generally in the developing world), was the subject of many impact studies prior to 2005. These studies were collected and discussed in a 2005 literature review.34 This review concluded that the evidence for microfinance's impact was strong, and implied that randomized controlled trials could be expected to demonstrate impact as well.35
However, to date the results from the two randomized controlled trials on microlending have been far less encouraging:
Head Start. A 2001 review examined studies on Head Start, a federal early childhood care program in the U.S., and found overwhelmingly positive, long-term effects on measures including achievement test scores and grade and school completion, while acknowledging the lack of a truly high-quality randomized study.38 In 2010, the first results from a very large, high-quality study became available and were far less encouraging.39
Speaking intuitively, we feel that the combination of selection bias and publication bias will cause most studies of non-profits' programs to exaggerate the case for optimism. We focus on studies that we think are less prone to these two biases, and believe that the randomized controlled trial (RCT) design is one of (though not the only) ways of mitigating these issues. We believe that higher-quality studies are likely to give a less positive picture of non-profit effectiveness than lower-quality studies
For definition see Poverty Action Lab, "Methodology: Overview."
"Over the past 25 years, evaluators of social programs have searched for nonexperimental methods that can substitute effectively for experimental ones. Recently, the spotlight has focused on one method, propensity score matching (PSM), as the suggested approach for evaluating employment and education programs." Peikes, Moreno, and Orzol 2008, Pg 222.
Peikes, Moreno, and Orzol 2008, Pg 222.
Peikes, Moreno, and Orzol 2008, Pgs 222-223.
"To assess nonexperimental (NX) evaluation methods in the context of welfare, job training, and employment services programs, the authors reexamined the results of twelve case studies." Glazerman, Levy, and Myers 2003, Pg 63.
"The authors reexamined the results of twelve case studies intended to replicate impact estimates from an experimental evaluation by using NX methods." Glazerman, Levy, and Myers 2003, Pg 63.
"To be included in the review, a study had to meet the following criteria…. The intervention’s purpose was to raise participants’ earnings. This criterion restricts our focus to programs that provide job training and employment services…. All of the interventions involved job training or employment services” Glazerman, Levy, and Myers 2003, Pg 68.
Glazerman, Levy, and Myers 2003, Pg 74.
"The average of the absolute bias over all studies was more than $1,000, which is about 10 percent of annual earnings for a typical population of disadvantaged workers." Glazerman, Levy, and Myers 2003, Pg 86.
Glazerman, Levy, and Myers 2003, Pg 86.
"Within-Study Comparisons of Impact Estimates for Employment and Training Programs.... Daniel Friedlander and Philip K. Robins (1995) benchmarked nonexperimental methods using data from a series of large-scale random-assignment studies of mandatory welfare-to-work programs in four states." Bloom, Michalopoulos, and Hill 2005, Pg 180, 186; italics in the original.
Friedlander and Robins 1995, Pg 935.
"The remainder of the chapter measures the selection bias resulting from nonexperimental comparison-group methods by benchmarking them against the randomized experiments that made up the National Evaluation of Welfare-to-Work Strategies (NEWWS), a six-state, seven-site evaluation that investigated different program approaches to moving welfare recipients to work." Bloom, Michalopoulos, and Hill 2005, Pg 194.
"With respect to what methods could replace random assignment, we conclude that there are probably none that work well enough in a single replication, because the magnitude of the mismatch bias for any given nonexperimental evaluation can be large. This added error component markedly reduces the likelihood that nonexperimental comparison-group methods could replicate major findings from randomized experiments such as NEWWS. Arguably more problematic is the fact that it is not possible to account for mismatch error through statistical tests or confidence intervals when nonexperimental comparison group methods are used.
Our results offer one ray of hope regarding nonexperimental methods. Although nonexperimental mismatch error can be quite large, it varies unpredictably across evaluations and has an apparent grand mean of 0. A nonexperimental evaluation that used several comparison groups might therefore be able to match a randomized experiment's impact estimate and statistical precision. It is important to recognize, however, that this claim rests on an empirical analysis that might not be generalizable to other settings.... It is possible that comparison-group approaches can be used to construct valid counterfactuals for certain types of programs and certain types of data. Considered in conjunction with related research exploring nonexperimental comparison-group methods, however, the findings presented here suggests that such methods, regardless of their technical sophistication, are no substitute for randomized experiments in measuring the impacts of social and education programs. Thus, we believe that before nonexperimental comparison-group approaches can be accepted as the basis for major policy evaluations, their efficacy needs to be demonstrated by those who would rely on them." Bloom, Michalopoulos, and Hill 2005, Pgs 224-225.
"Within-Study Comparisons of Impact Estimates for Education Programs.... The School Dropout Prevention Experiment Roberto Agodini and Mark Dynarski (2004) compared experimental estimates of impacts for dropout prevention programs in eight middle schools and eight high schools with alternative nonexperimental estimates.... Using extensive baseline data, the authors tested propensity-score matching methods, standard OLS regression models, and fixed-effects models.... The Tennessee Class-Size Experiment Elizabeth Ty Wilde and Robinson Hollister (2002) compared experimental and nonexperimental estimates of the impacts on student achievement of reducing class size.... Wilde and Hollister (2002) used propensity-score methods to find matches for the school's program-group students in the pooled sample of control-group students in the other ten schools. The authors also compared experimental impact estimates with nonexperimental estimates obtained from OLS regression methods without propensity-score matching." Bloom, Michalopoulos, and Hill 2005, Pgs 190-191; italics in the original.
"A number of meta-analyses ... have compared summaries of findings based on experimental studies with summaries based on nonexperimental studies.... The most extensive such comparison was a meta-analysis of meta-analyses in which Mark W. Lipsey and David B. Wilson (1993) synthesized earlier research on the effectiveness of psychological, education, and behavior treatments. In part of their analysis, they compared the means and standard deviations of experimental and nonexperimental impact estimates from seventy-four meta-analyses for which findings from both types of studies were available. Representing hundreds of primary studies, this comparison revealed little difference between the mean effect estimated on the basis of experimental studies.... Lipsey and Wilson (1993, 1193) concluded:
'These various comparisons do not indicate that it makes no difference to the validity of treatment effect estimates if a primary study uses random versus nonrandom assignment. What these comparisons do indicate is that there is no strong pattern or bias in the direction of the difference made by lower quality methods.... In some treatment areas, therefore, nonrandom designs (relative to random) tend to strongly underestimate effects, and in others, they tend to strongly overestimate effects.'" Bloom, Michalopoulos, and Hill 2005, Pgs 192-193.
"This paper analyzes 12 recent within-study comparisons contrasting causal estimates from a randomized experiment with those from an observational study sharing the same treatment group." Cook, Shadish, and Wong 2008, Pg 724.
"We identify three studies comparing experiments and regression-discontinuity (RD) studies. They produce quite comparable causal estimates at points around the RD cutoff. We identify three other studies where the quasi-experiment involves careful intact group matching on the pretest. Despite the logical possibility of hidden bias in this instance, all three cases also reproduce their experimental estimates, especially if the match is geographically local. We then identify two studies where the treatment and nonrandomized comparison groups manifestly differ at pretest but where the selection process into treatment is completely or very plausibly known. Here too, experimental results are recreated. Two of the remaining studies result in correspondent experimental and nonexperimental results under some circumstances but not others, while two others produce different experimental and nonexperimental estimates, though in each case the observational study was poorly designed and analyzed. Such evidence is more promising than what was achieved in past within-study comparisons, most involving job training." Cook, Shadish, and Wong 2008, Pg 724.
"Eight of the comparisons produced observational study results that are reasonably close to those of their yoked experiment, and two obtained a close correspondence in some analyses but not others. Only two studies claimed different findings in the experiment and observational study, each involving a particularly weak observational study. Taken as a whole, then, the strong but still imperfect correspondence in causal findings reported here contradicts the monolithic pessimism emerging from past reviews of the within-study comparison literature.... RD [regression-discontinuity] is one type of nonequivalent group design, and three studies showed that it produced generally the same causal estimates as experiments.... The basic conclusion, though, is that RD estimates are valid if they result from analyses sensitive to the method’s main assumptions. We can also trust estimates from observational studies that match intact treatment and comparison groups on at least pretest measures of outcome." Cook, Shadish, and Wong 2008, Pg 745.
Cook, Shadish, and Wong 2008, Pg 745.
"In the job training work, the quasi-experimental design structures were heterogeneous in form and underexplicated relative to the emphasis the researchers placed on statistical models and analytic details. It is as though the studies’ main purpose was to test the adequacy of whatever nonexperimental statistical practice for selection bias adjustment seemed current in job training at the time. This is quite different from trying to test best possible quasi-experimental design and analysis practice, as we have done here." Cook, Shadish, and Wong 2008, Pg 748.
Cook, Shadish, and Wong 2008, Pg 746.
Duflo and Kremer 2003, Pg 24.
Phrasing from Wikipedia.
"Very low t-statistics appear to be systematically absent--and therefore null hypotheses are overwhelmingly false - only when the universe of null hypotheses considered are the central themes of published economics articles.
This suggests, to us, a publication-bias explanation of our finding. What makes a journal editor choose to publish an article which fails to reject its central null hypothesis, which produces a value of É(a) > 0.1 for its central hypothesis test? The paper must excite the editor's interest along some dimension, and it seems to us that the most likely dimension is that the paper is in apparent contradiction to earlier work on the same topic: either others working along the same line have in the past rejected the same null, or because theory or conventional wisdom suggests a significant relation." De Long and Lang 1992, Pg 13-14.
"Fortunately, we can test for reporting bias. The intuition for this test begins by noting that different approaches to estimating the effect of executions on the homicide rate should yield estimates that are somewhat similar. That said, some approaches yield estimates with small standard errors, and hence these should be tightly clustered around the same estimate, while other approaches yield larger standard errors, and hence the estimated effects might be more variable. Thus, there is likely to be a relationship between the size of the standard error and the variability of the estimates, but on average there should be no relationship between the standard error and the estimated effect. By implication, if there is a correlation between the size of the estimate and its standard error, this finding suggests that reported estimates comprise an unrepresentative sample. One simple possibility might be that researchers are particularly likely to report statistically significant results, and thus they only report on estimates that have large standard errors if the estimated effect is also large. If this were true, we would be particularly likely to observe estimates that are at least twice as large as the standard error, and therefore coefficient estimates would be positively correlated with the standard error … the reported estimates appear to be strongly correlated with their standard errors: we find a correlation coefficient of 0.88, which is both large and statistically significant. Second, among studies with designs that yielded large standard errors, only large positive effects are reported, despite the fact that such designs should be more likely to also yield small effects or even large negative effects. And third, we observe very few estimates with t-statistics smaller than two, despite the fact that the estimated deterrent effect required to meet this burden rises with the standard error.
Moreover, while Figure 9 focuses only on the central estimate from each study, Figure 10 shows the pattern of estimated coefficients and standard errors reported within each study. Typically these various estimates reflect an author’s attempt to assess the robustness of the preferred result to an array of alternative specifications. Yet within each of these studies (except Katz, Levitt, and Shustorovich) we find a statistically significant correlation between the standard error of the estimate and its coefficient, which runs counter to one’s expectations from a true sensitivity analysis." Donohue and Wolfers 2006, Pg 839-840.
See our review.
See our review.
Rossi 1987. Excerpted in Roodman.
Note that we do discuss direct comparisons of randomized to nonrandomized studies above. However, in these comparisons, the nonrandomized studies are constructed purely for the purpose of comparison to randomized studies, i.e., for methodological reasons and not investigative ones. Therefore, they are not truly "evaluations" of the social programs in question and are not prone to the same concerns about publication bias that evaluations would be.
Goldberg 2005. Also see our 2008 review of these studies expressing concerns about selection bias.
"It would be hard to read through all of the many positive findings in these dozens of studies - noting how rarely the comparison groups showed better outcomes than clients - and not feel that microfinance is an effective tool for poverty eradication.
On the other hand, considernig all the ways we have seen in which subtle differences between clients and comparisons groups can affect the conclusions we draw, the evidence, as convincing as it is, is not quite good enough. It will be an enormous benefit to the entire industry when the first "incontrovertible" study is published. The only way to achieve this is through randomized control trials. Fortunately, the first of these studies is already underway. While the first use of randomized evaluations may be to prove the effectiveness of microfinance programs, MFI managers, as consumers of information, may soon start to demand randomized trials for informing their management decisions.
Banerjee et al. 2009, abstract.
"Esther Duflo presented the second set of new data of the morning, “fresh from the oven” in her words. Duflo’s study with the microfinance institution Al Amana took place in rural Morocco in areas previously unserved by formal financial institutions. In all, around 5000 households were captured in the study of the impact of a group liability microcredit product.
Since the people in the study would not have been exposed to formal financial services, the target method was expressly designed to offer services to a higher proportion of people who, based on assessments of baseline data, would be more likely to take-up loans. Despite these efforts and the heavy marketing of the bank, only 16 percent of those who were offered loans took them (interestingly, as with the Karlan data, a lot of the study participants lie in follow-on interviews about having taken a loan – why would they do that?)
So what was the impact on those credit recipients?
The study found no impact on household consumption.
The study found no improvements in welfare.
The study found no effect on the likelihood that a recipient would start a new business.
The study did not show an increased ability to deal with shocks.
The study did find for people who already had a business, however, that loan recipients were more likely to stop engaging in wage work and invest more in their businesses. Livestock owners were more likely to buy more livestock and of a different variety than they had previously owned (so cow farmers diversified with sheep, and vice-versa, creating a de factor savings). And agricultural business sales increased, they took on more employees and those employee wages went up. Non-agricultural businesses did not show the same positive effects and income, on average, did not increase, partly because increases in the household business were offset by the “substitution” effect of decreased wage work." Starita 2010. Note that we refer here to a summary of a conference presentation because the study itself is not yet published.
See Currie 2001, Pg 223, Table 2. Context: "there has never been a large-scale, randomized trial of a typical Head Start program, although plans for such a trial are now underway at the U.S. Department of Health and Human Services … Table 2 provides an overview of selected studies, focusing on those which are most recent and prominent and on those which have made especially careful attempts to control for other factors that might affect outcomes" (Currie 2001, Pg 222).
See GiveWell Blog, "High-quality study of Head Start early childhood care program," for a summary of the study, which is U.S. Department of Health and Human Services, Administration for Children and Families 2010.