This page discusses the nature and extent of two common problems we see with formal evaluations: selection bias and publication bias.
- Selection bias arises when participants in a program are systematically different from non-participants (even before they enter the program). Many evaluations compare program participants to non-participants in order to infer the effect of the program; selection bias can affect the legitimacy of these evaluations, and in particular, we believe that its presence is likely to skew evaluations of non-profits in the positive direction. More
- Publication bias refers to the tendency of researchers to slant their choice of presentation and publication in a positive direction. More
We believe that these problems tend to skew evaluations of non-profits in the positive direction
and that these problems are partially mitigated by the use of the randomized controlled trial (RCT) methodology
(and can be mitigated by other techniques as well).
Below, we discuss both selection bias and publication bias: what they are, what sort of skew they are likely to bring about, why randomized controlled trials may suffer less from these issues, and what evidence is available on the extent and nature of these issues.
We then note two highly touted and studied interventions - microlending and Head Start - for which initial studies (highly prone to selection bias and publication bias) gave a much more positive picture than later results from randomized controlled trials. (More
) We see these cases as suggestive evidence for the view that lower-quality studies tend to give an exaggerated case for optimism about effectiveness.
What is selection bias?
Studies of social programs commonly compare people who participated in the program to people who did not, with the implication being that any differences are caused by the program. (Some studies report only on improvements or good performance among participants, but even in these cases there is often an implicit comparison to non-participants - for example, an implicit presumption that non-participants would not have shown improvement on the reported-on measures.)
However, program participants are different from non-participants, by the very fact of their participation
. An optional after-school tutoring program may disproportionately attract students/families who place a high priority on education (so its participants will have better reading scores, graduation rates, etc. than non-participants even if the program itself has no effect); a microlending program may disproportionately attract people who have higher incomes to begin with; etc.
What sort of skew is selection bias likely to cause?
Selection bias may skew a study in a positive or negative direction. Say that an after-school program targets struggling schools; in this case, comparing its participants to "average" students all across the city may be overly unfavorable to the tutoring program (since its students likely do not score as well as students in better-off schools), but comparing its participants to other students at the same schools may be overly favorable (since, as discussed above, its participants may tend to place higher priority on education).
One of the reasons we are concerned about selection bias is because it gives the researchers substantial room for judgment calls in their choice of comparison group. When it comes to studies on non-profits' impacts, we believe that researchers generally prefer to present the programs in a positive light, and thus tend to choose comparisons that favor the programs (more on this below under "Publication bias"). Thus, we feel that selection bias is generally likely to skew apparent results in favor of non-profits' programs.
Selection bias in low- vs. high-quality studies
Certain study designs are much less vulnerable to selection bias than others. A randomized evaluation, also known as a randomized controlled trial, generally avoids the problem of selection bias by using random assignment to assign some people and not others to a program; then people who were "lotteried in" (randomly assigned) to the program are tracked and compared to people who were "lotteried out." Intuitively speaking, this methodology seems to significantly reduce the risks that there will be any systematic differences between program participants and non-participants, other than whether they participated in the program.
There are other ways of addressing the problem of selection bias, fully or partially. Speaking broadly, we feel that randomization is the single most reliable indicator that a study's findings can be interpreted without fear of selection bias, and most of the studies we refer to as "high-quality" involve randomization. However, there are studies we consider "high-quality" that do not involve randomization, such as the impact evaluation of VillageReach's pilot project
Example of selection bias
Peikes, Moreno, and Orzol (2008) evaluated the impact of the US State Partnership Initiative employment promotion program, using two methods: (a) a randomized controlled trial, with very low vulnerability to selection bias (see discussion above regarding randomization); (b) propensity-score matching, a relatively popular method for attempting to simulate a comparison between program participants and identical non-participants without the benefit of randomization (using available observable characteristics of participants and non-participants). Despite "seemingly ideal circumstances" for method (b), the two methods produced meaningfully different results: in two of the three locations, method (b) implied large, positive, statistically significant impacts of the program on earnings, while method (a) implied negative, non-statistically significant impacts of the program on earnings. The authors concluded:
Despite these seemingly ideal conditions, and the passing of tests that, according to the literature, indicate PSM [propensity-score matching] had worked, PSM produced impact estimates that differed considerably from the gold standard experimental estimates in terms of statistical significance, magnitude, and most important, sign. Specifically, the PSM approach would have led policymakers to conclude incorrectly that the interventions increased earnings, when they actually decreased or had no effects on earnings. Based on this experience, our goal is to caution practitioners that PSM can generate incorrect estimates, even under seemingly ideal circumstances.
In this case, the attempt to compare program participants to similar non-participants using observable
characteristics (which is what method (b) relied on) implied that the participants earned much more than non-participants; however, comparing lotteried-in to lotteried-out
people showed no such thing. This implies that there were unobservable ways in which participants differed from non-participant, ways that were significant enough to create the illusion of a strong program effect.
Studies on selection bias
We conducted a search for literature reviews of studies directly comparing the results of randomized and non-randomized estimates of social programs' effects. The most complete and recent literature reviews are summarized here.
Our overall take on these studies is that they (a) focus on the best-designed non-randomized studies; (b) show mixed results, and give substantial reason for concern that non-randomized studies' results can diverge significantly from randomized studies' results.
They do not show that selection bias systematically skews results in one direction or another; they do show that the presence of selection bias introduces a substantial source of skew. We believe that in the case of programs run by non-profits
this skew is likely to be positive more often than negative.
Review 1: Glazerman, Levy, and Myers (2003).
This review examines twelve studies "in the context of welfare, job training, and employment services programs." Each of the studies estimates a program’s impact by using a randomized controlled trial, and separately estimates the impact by using one or more nonrandomized methods. Each of the programs aimed to raise earnings.
Review 2: Bloom, Michalopoulos, and Hill (2005)
- “Four studies concluded that NX [nonrandomized] methods performed well, four found evidence that some NX methods performed well while others did not, and four found that NX methods did not perform well or that there was insufficient evidence
that they did perform well.”
- Aggregate analysis of the studies implied that the average effect found by nonrandomized studies differed by over $1,000 compared to the effect found by randomized studies. This was "about 10 percent of annual earnings for a typical population of disadvantaged workers."
- In the concluding section, the authors pose the question, "Can NX [nonrandomized] methods approximate the results from a well-designed and well-executed experiment?" Their answer is: "Occasionally, but many NX [nonrandomized] estimators produced results dramatically different from the experimental benchmark."
reviews the question of randomized vs. nonrandomized evaluation in a variety of sectors.
Review 3: Cook, Shadish, and Wong (2008)
- Employment/earnings-related: several of the studies examined overlap with the studies discussed above. The exceptions to this overlap are:
- Friedlander and Robins (1995), which uses data from a series of large-scale studies of welfare-to-work programs in four states and concludes that "estimates of program effects from cross-state comparisons can be quite far from the true effects, even when samples are drawn (as ours were) with the same sample intake procedures and from target populations defined with the same objective characteristics."
- An original comparison in Bloom, Michalopoulos, and Hill (2005) using a "six-state, seven-site evaluation that investigated different program approaches to moving welfare recipients to work." The authors conclude that a nonrandomized evaluation using several comparison groups might be able to match a randomized study's impact estimate and precision, but that "with respect to what methods could replace random assignment, we conclude that there are probably none that work well enough in a single replication, because the magnitude of the mismatch bias for any given nonexperimental evaluation can be large."
- Education. The review discusses two within-study comparisons of randomized and nonrandomized estimates of the impacts of a school programs, one aiming to prevent dropout and the other reducing class size. Both studies conclude that nonrandomized methods have a high risk of leading to misleading conclusions about impacts; the second study specifically states that "[in] 35 to 45 percent of the 11 cases … [nonrandomized methods] would have led to the 'wrong decision,' i.e., a decision about whether to invest which was different from the decision based on the experimental [randomized] estimates."
- Other. The review also discusses a meta-analysis (Lipsey and Wilson 1993) of other evaluations of psychological, education, and behavior programs. This study states that "In some treatment areas ... nonrandom designs (relative to random) tend to strongly underestimate effects, and in others, they tend to strongly overestimate effects. The distribution of differences on methodological quality ratings shows a similar pattern."
analyzes twelve within-study comparisons of randomized and nonrandomized methods. These twelve comparisons are from ten publications, and span a variety of social programs, mainly from the US. The publications are not included in Glazerman, Levy, and Myers (2003), and only one is included in Bloom, Michalopoulos, and Hill (2005).
Cook, Shadish, and Wong (2008) finds that in two of the comparisons, the nonrandomized method sometimes achieves the same result as the randomized method and sometimes does not; in two other comparisons the nonrandomized methods fail to achieve the same result; and in the other eight comparisons the nonrandomized methods replicate the randomized methods reasonably. The review concludes that "the strong but still imperfect correspondence in causal findings reported here contradicts the monolithic pessimism emerging from past reviews of the within-study comparison literature." However, the review explicitly concentrates on the "best possible [nonrandomized] design and analysis practice," and also states:
This review showed that use of “off-the-shelf” (mostly demographic) covariates consistently failed to reproduce the results of experiments. Yet such variables are often all that is available from survey data. They are also, alas, all that some analysts think they need. The failure of “off-the-shelf” covariates indicts much current causal practice in the social sciences, where researchers finish up doing propensity score and OLS analyses of what are poor quasi-experimental designs and impoverished sets of covariates. Such designs and analyses do not undermine OLS or propensity scores per se—only poor practice in using these tools. Nonetheless, we are skeptical that statistical adjustments can play a useful role when population differences are large and quasi-experimental designs are weak … This means we are skeptical about much current practice in sociology, economics, and political science.
What is publication bias?
"Publication bias" is a broad term for factors that systematically bias final, published results
in the direction that the researchers and publishers (consciously or unconsciously) wish them to point.
Interpreting and presenting data usually involves a substantial degree of judgment on the part of the researcher; consciously or unconsciously, a researcher may present data in the most favorable light for his/her point of view. In addition, studies whose final conclusions aren't what the researcher (or the study funder) hoped for may be less likely to be made public.
What sort of skew is publication bias likely to cause?
As discussed below, the existing literature on publication bias often concludes that studies are skewed toward showing (a) more "surprising" findings; and (b) more "positive" findings (indicating that medical treatments, social policies, etc. "work").
We have not identified any studies specifically on publication bias in evaluations of non-profit
programs, but we would guess that these studies would be skewed to the optimistic side, simply because the non-profits cooperating in the studies and the funders paying for them have incentives to portray their work in a positive light, and we know of no study funders or implementers with incentives to skew results in the pessimistic direction.
Publication bias in low- vs. high-quality studies
We are less concerned about publication bias in studies that have the following qualities, in descending order of importance:
- Registration. ClinicalTrials.gov is an example of a registry where researchers post the design, methodology, and hypothesis for each study before data is actually collected. In our view, this makes researchers accountable to public scrutiny if results are later buried or interpreted in a skewed way. More on this idea
- Randomized design. Above, we discuss the design of a randomized controlled trial (RCT), a study in which a lottery determines who is and isn't enrolled in a program. We agree with Esther Duflo's argument that a study with this sort of design is less susceptible to publication bias:
Publication bias is likely to a particular problem with retrospective studies. Ex post the researchers or evaluators define their own comparison group, and thus may be able to pick a variety of plausible comparison groups; in particular, researchers obtaining negative results with retrospective techniques are likely to try different approaches, or not to publish … In contrast, randomized evaluations commit in advance to a particular comparison group: once the work is done to conduct a prospective randomized evaluation the results are usually documented and published even if the results suggest quite modest effects or even no effects at all.
- High expense. In our view, a study that is very expensive to carry out is likely to be published regardless of what it shows and how favorable its findings are to the researchers' hopes. The presentation of the data may still be skewed, but the threat that the study is "buried" seems smaller.
We have not seen systematic investigations of the hypotheses laid out above.
Studies on publication bias
We have not yet conducted a systematic review of literature on publication bias, but we have come across several studies on the subject.
Hopewell et al. (2009) reviewed five studies examining patterns in which clinical trials did and didn't have their results published in medical literature:
These studies showed that trials with positive findings … or those findings perceived to be important or striking, or those indicating a positive direction of treatment effect), had nearly four times the odds of being published compared to findings that were not statistically significant … or perceived as unimportant, or showing a negative or null direction of treatment effect.
Ioannidis (2005a and 2005b) explored the magnitude of the problem and concluded that from both a theoretical and empirical perspective, there is reason to be skeptical of much (even most) of the conclusions published in medical literature. These studies also provide some loose arguments that studies with less flexibility, particularly randomized controlled trials, are likely to be less susceptible to these issues.
De Long and Lang (1992) gives some evidence for a broad form of publication bias in the field of economics. It examines published papers that fail to reject their central "null hypothesis" (the "null hypothesis" generally referring to a "general or default position, such as that there is no relationship between two measured phenomena") and finds that an aggregate
analysis of these papers' results suggests that the individual
results are erroneous - i.e., most or all of the central "null hypotheses" that the papers fail to reject are in fact false. It concludes that the best explanation for this phenomenon involves publication bias: papers rejecting their central "null hypothesis" are not published without prejudice, but rather published largely (or only) when their rejection is "exciting."
Publication bias in more narrow topics.
- Donohue and Wolfers (2006) presents evidence that papers on the deterrent effect of the death penalty seem skewed toward publishing positive results (i.e., results that imply a real deterrent effect).
- The Campbell Collaboration frequently tests for publication bias in studies on specific interventions such as volunteer tutoring programs and programs seeking to improve parental involvement in children's academics. In both of these cases, no evidence for publication bias was found. We intend to investigate the Campbell Library more thoroughly in the future for better context on the risks of publication bias.
Suggestive evidence on the combined effects of selection bias and publication bias: the cases of microlending and Head Start
If publication bias is a real and significant problem, this could be expected to imply that studies of social programs will tend to exaggerate the programs' impact - especially studies that are prone to selection bias and otherwise leave significant room for judgment calls on the part of the researchers.
This idea is similar to Rossi (1987)'s "Stainless Steel Law of Evaluation," which is the proposition that:
The better designed the impact assessment of a social program, the more likely is the resulting estimate of net impact to be zero.
This law means that the more technically rigorous the net impact assessment, the more likely are its results to be zero—or no effect. Specifically, this law implies that estimating net impacts through randomized controlled experiments, the avowedly best approach to estimating net impacts, is more likely to show zero effects than other less rigorous approaches.
We have encountered no formal empirical study of this "law." We believe it to be valid based partly on our own experience of reviewing the highest-quality academic literature we can find compared with our experience of reviewing evaluations submitted by non-profits; we intend to document this comparison more systematically in the future.
Here, we discuss two cases that we believe provide suggestive evidence for the above proposition: microlending and Head Start. In both of these cases, we are able to compare a systematic overview of relatively low-quality studies (i.e., highly prone to selection bias, and with substantial room for judgment in their construction) to later evidence from randomized controlled trials. In both of these cases, the earlier, lower-quality research presents a much more optimistic picture than the randomized controlled trials.
, the practice of making small loans to low-income people (generally in the developing world), was the subject of many impact studies prior to 2005. These studies were collected and discussed in a 2005 literature review. This review concluded that the evidence for microfinance's impact was strong, and implied that randomized controlled trials could be expected to demonstrate impact as well.
However, to date the results from the two randomized controlled trials on microlending have been far less encouraging:
- Banerjee et al. (2009) conducted a randomized controlled trial of a microlending program in India and concluded that "15 to 18 months after the program, there was no effect of access to microcredit on average monthly expenditure per capita, but durable expenditure did increase … We find no impact on measures of health, education, or women's decision-making."
- A more recent study in rural Morocco found similar results, seeing different effects on different borrowers but no aggregate effect on measures of well-being.
A 2001 review examined studies on Head Start, a federal early childhood care program in the U.S., and found overwhelmingly positive, long-term effects on measures including achievement test scores and grade and school completion, while acknowledging the lack of a truly high-quality randomized study. In 2010, the first results from a very large, high-quality study became available and were far less encouraging.
Speaking intuitively, we feel that the combination of selection bias and publication bias will cause most studies of non-profits' programs to exaggerate the case for optimism. We focus on studies that we think are less prone to these two biases, and believe that the randomized controlled trial (RCT) design is one of (though not the only) ways of mitigating these issues. We believe that higher-quality studies are likely to give a less positive picture of non-profit effectiveness than lower-quality studies
- Agodini, Roberto and Mark Dynarski. 2004. Are experiments the only option? A look at dropout prevention programs. The Review of Economics and Statistics 86(1): 180-194.
- Banerjee, Abhijit, et al. 2009. The miracle of microfinance? Evidence from a randomized evaluation (PDF).
- Bloom, Michalopoulos, and Hill. 2005. Using experiments to access nonexperimental comparison-group methods for measuring program effects. In Learning More from Social Experiments, ed. Howard S. Bloom, 173-236. New York: Russell Sage Foundation.
- ClinicalTrials.gov. Homepage. http://www.clinicaltrials.gov/ (accessed November 24, 2010). Archived by WebCite® at http://www.webcitation.org/5uUT3tx5V.
- Cook, Thomas D., William R. Shadish, and Vivian C. Wong. 2008. Three conditions under which experiments and observational studies produce comparable causal estimates: New findings from within-study comparisons (PDF). Journal of Policy Analysis and Management 27(4): 724-750.
- Currie, Janet. 2001. Early childhood education programs (PDF). Journal of Economic Perspectives 15(2): 213-238.
- De Long, J. Bradford and Kevin Lang. 1989. Are all economic hypotheses false? (PDF). Journal of Political Economy 100(6): 1257-72.
- Donohue, John J. and Justin Wolfers. 2006. Uses and abuses of empirical evidence in the death penalty debate. Standford Law Review 58: 791-846. Abstract available at http://www.nber.org/papers/w11982 (accessed November 29, 2010). Archived by WebCite® at http://www.webcitation.org/5ubSlMHqE.
- Duflo, Esther and Michael Kremer. 2003. Use of randomization in the evaluation of development effectiveness (PDF). In Conference on Evaluation and Development Effectiveness, Washington DC, 2003. Washington DC: World Bank Operations Evaluation Department.
- Friedlander, Daniel and Philip K. Robins. 1995. Evaluating program evaluations: New evidence on commonly used nonexperimental methods. American Economic Review 85(4): 923-937.
- GiveWell Blog. High-quality study of Head Start early childhood care program.
- Glazerman, Steven, Dan M. Levy, and David Myers. 2003. Nonexperimental versus experimental estimates of earnings impacts. Annals of the American Academy of Political and Social Science 589(1): 63-93. Abstract available at http://ann.sagepub.com/content/589/1/63.short (accessed November 29, 2010). Archived by WebCite® at http://www.webcitation.org/5ubSEV2OZ.
- Goldberg, Nathanael. 2005. Measuring the impact of microfinance: Taking stock of what we know (PDF). Washington DC: Grameen Foundation USA.
- Hopewell, S., et al. 2009. Publication bias in clinical trials due to statistical significance or direction of trial results. Cochrane Database of Systematic Reviews 2009, Issue 1. Summary available at http://www2.cochrane.org/reviews/en/mr000006.html (accessed on November 24, 2010). Archived by WebCite® at http://www.webcitation.org/5uUR0dH36.
- Ioannidis 2005a. Contradicted and initially stronger effects in highly cited clinical research (PDF). JAMA 294(2): 218-228.
- Ioannidis 2005b. Why most published research findings are false (PDF). PLoS Medicine 2(8): e124.
- Peikes, Deborah N., Lorenzo Moreno, and Sean Michael Orzol. 2008. Propensity score matching. American Statistician 62(3): 222-231. Abstract available at http://pubs.amstat.org/doi/abs/10.1198/000313008X332016 (accessed November 29, 2010). Archived by WebCite® at http://www.webcitation.org/5ubSOccyz.
- Poverty Action Lab. Methodology: Overview. http://www.povertyactionlab.org/methodology (accessed November 24, 2010). Archived by WebCite® at http://www.webcitation.org/5uUL7mgH8.
- Roodman, David. Rossi's rules (accessed October 19, 2010). David Roodman's Microfinance Open Book Blog, July 13, 2009. Archived by WebCite® at http://www.webcitation.org/5tfUdibYt.
- Rossi, H. 1987. The iron law of evaluation and other metallic rules. Research in Social Problems and Public Policy 4: 3–20.
- Starita, Laura. Microfinance impact and innovation: Microfinance impacts (accessed November 24, 2010). Philanthropy Action News & Commentary, October 21, 2010. Archived by WebCite® at http://www.webcitation.org/5uUS9HHEn.
- Wilde, Elizabeth Ty and Robinson Hollister. 2002. How close is close enough? Testing nonexperimental estimates of impact against experimental estimates of impact with education test scores as outcomes (PDF). Institute for Research on Poverty Discussion Paper no. 1242-02.
- Wikipedia. Null hypothesis. http://en.wikipedia.org/wiki/Null_hypothesis (accessed November 24, 2010). Archived by WebCite® at http://www.webcitation.org/5uUS1Wa39.
- U.S. Department of Health and Human Services, Administration for Children and Families. Head Start impact study technical report (2010) (PDF).