In a nutshell
At GiveWell, we regularly scrutinize our research and grantmaking decisions. This year, we experimented with using AI to "red team" our work—asking AI models to critique our research and identify weaknesses we should investigate. We developed a series of LLM prompts based on input from people with experience using AI tools, then used these prompts to review research on six of our grantmaking areas. (more)
We found that the AI red teaming surfaced some useful critiques. Some were issues we hadn’t considered fully yet and that we think are worth looking into further (e.g., whether improvements in housing quality over time might explain some of the mortality reduction currently attributed to bed nets). Others were issues that we had considered but weren’t included in the source material, which helps us validate that AI can catch real problems in our work (e.g., concerns about monitoring and evaluation for seasonal malaria chemoprevention programs that we ended up surfacing via an intensive, human-led red teaming effort over several weeks) (more).
However, there were substantial limitations to the AI-generated critiques. Most of the critiques (~85%) seemed unlikely to affect our bottom line or represented misunderstandings of our work, and some included hallucinated details. Our researchers had to filter the output to identify useful critiques that were worth the time to dig into.
This figure reflects testing from earlier in 2025. While we haven't systematically retested, our informal impression is that as of December 2025, relevance has improved, to around 30%, primarily due to reduced hallucination and better model understanding of our research context.
Going forward, we plan to use AI red teaming selectively when reviewing our work. We think it's particularly valuable for newer grantmaking areas or research questions where we haven't done extensive research yet. We don't currently expect AI red teaming to result in major changes to our grantmaking decisions, but we think the low time investment required to produce moderately useful critiques (~75 minutes of human input per report) will be worthwhile in some cases. (more)
We also plan to keep testing new models as they're released, since AI capabilities are advancing and today's results may significantly understate what's possible in the near future.
We also plan to look for ways to make our work more accessible to AI and experiment with alternative LLM approaches (e.g., multi-agent workflows). Beyond red teaming, we're investigating other ways AI could strengthen our research and grantmaking processes. (more)
Published: January 2026
What did we do?
We've found “red teaming” (deliberately challenging our research to find ways we could be wrong) to be useful in our work in the past (see, for example, our reports on red teaming our top charities and iron grantmaking).
But we don't currently have capacity to thoroughly red team all the interventions or grants we’ve funded or considered funding. We were curious to see if we could use AI to expand our red teaming capacity. We also wanted to test whether AI could surface issues we hadn't yet considered.
From May to September 2025, we experimented with using AI to red team our research—asking AI to act as a critical reviewer who could identify weaknesses, blind spots, or alternative interpretations in our analysis that might significantly affect our conclusions.
We developed a series of prompts through trial-and-error1 and talking to several people with experience using AI, including:
- Dominic Vinyard, Founding AI Lead, Wordware
- James Aung, AI technical product manager
- Mashfir Mohiuddin, Data engineer, OpenAI
- Nathan Labenz, AI founder and researcher, Waymark & Cognitive Revolution
We ended up with the following approach:
- Step 1: Compile source materials and background research. For each grantmaking area, we first used AI (specifically, ChatGPT 5 Pro’s Deep Research feature) to compile an overview of the academic literature on the topic. This produced a 15-20 page summary of key studies, debates, and methodological approaches. We then used this output alongside our published intervention report and internal research materials (like recent grant approval documents or transcripts of calls with experts) as inputs for the “red teamer”.
- Step 2: Run the red teaming prompt. We fed these materials to an AI model with detailed instructions on how to critique our work. The prompt had several key features:
- Explicit thinking process. We instructed the AI to first generate 20-30 potential critiques without filtering, then verify each one for novelty (explicitly checking "Is this already in the report?"), assess potential impact on cost-effectiveness, and finally select the top 15 based on both novelty and impact.
- Guidelines on using source materials. We told the AI to use our documents to understand the intervention deeply, but to generate insights that go beyond what any of the sources discuss. We explicitly instructed it not to echo our existing research—the value was in identifying what our analyses had overlooked.
- Structured categories for critiques. We asked the AI to consider critiques in different categories (evidence quality, methodology limitations, alternative interpretations, implementation challenges, external validity, and overlooked factors) and to specify for each critique why it's novel, its potential impact magnitude, and the AI's confidence level.
- Step 3: Human review and prioritization. Researchers on the relevant grantmaking teams reviewed the AI-generated critiques and identified which seemed most important and worth investigating further.
We tested red teaming with each of Anthropic’s Claude Opus 4.1, Google’s Gemini 2.5 Pro, and OpenAI’s ChatGPT 5 Pro.2 At the time of our latest round of testing (September 2025), we felt that ChatGPT 5 Pro produced the most consistently useful critiques of our work.
We applied this approach to the following grantmaking areas:
- Syphilis screening and treatment in pregnancy
- Malaria vaccines
- Community-based management of acute malnutrition (CMAM)
- Distribution of insecticide-treated bed nets
- Seasonal malaria chemoprevention
- Water quality interventions
We selected these based on the importance of the grantmaking area and grantmaking teams’ interest and capacity to participate in this pilot.
For each of the grantmaking areas below, we used AI to generate at least 15 critiques of our research. We then had grantmaking teams and other researchers review those critiques and identify which seemed important. Altogether, this process takes about 75 minutes: 15 minutes to collect context for the AI and run our prompts and about an hour for grantmakers to review the output.
What did we learn about our grantmaking areas?
Water chlorination
Chlorination is a method of disinfecting drinking water. We've funded several programs to increase uptake of chlorinated water in low- and middle-income countries, including dispensers for safe water and in-line chlorination systems.3
The AI red teaming generated ~15 critiques of our water chlorination research. Our water team reviewed these and identified four issues worth considering further. While we don't currently think these are likely to have a major impact on our assessment of chlorination programs, we think they merit investigation. Of these, one was a critique that we had not previously considered and three we were already planning to look into but weren't in the source materials we provided (which gives us some additional confidence in AI’s ability to generate meaningful critiques of our work in the future—especially those we’ve looked at in less depth).
Critiques we hadn't previously considered:
- Water turbidity.4 Seasonal floods and extreme rain events increase suspended solids in water (e.g., agricultural runoff or soil washed in from erosion), which can significantly reduce chlorine's effectiveness.5 The AI noted that none of the mortality trials occurred during severe flood years.6 We think it would be worth better understanding whether chlorine dosing is adjusted for increased turbidity and/or how this moderates chlorine’s effect.
Critiques we’d identified before (but weren’t in the materials provided to the AI):
- Crowding out water infrastructure investments.7 The AI flagged that free chlorine dispensers might reduce local officials' incentive to invest in piped water systems. While the specific evidence the AI cited appears to be fabricated,8 the underlying concern is plausible. If chlorination programs delay piped water investments by even two years, the welfare costs could be substantial. We plan to investigate this by talking to local officials and experts in countries we fund chlorination programs.
- Optimal chlorine dosing.9 Our models treat chlorine residual as binary (present or absent), but the relationship is more complex. Disinfection effectiveness depends on concentration-time10 (e.g., 0.5 mg/L for 60 minutes can achieve the same pathogen kill as 1.0 mg/L for 30 minutes), while taste aversion is driven by overall concentration.11 Finding the right balance could improve both effectiveness and adherence.
- Disinfection byproducts.12 In rural areas with high natural organic matter, repeated chlorine dosing can generate trihalomethanes above WHO guidelines.13 These compounds may pose carcinogenic or pregnancy risks that partially offset mortality benefits. This is on our list to investigate.
Full output is here.
Community-based management of acute malnutrition (CMAM)
CMAM involves identifying and treating cases of acute malnutrition in children. We've funded several organizations to deliver CMAM programs in Africa.14
The AI red teaming generated ~25 critiques of our CMAM research. Our nutrition team reviewed these and identified two issues worth investigating further, though most concerns were either already addressed in our models or highly speculative.
Critiques we hadn't previously considered:
- Regression to the mean in measuring treatment effects.15 The AI flagged that our model relies on before-and-after weight measurements that may overstate CMAM's impact.16 Children's weight naturally fluctuates due to hydration, measurement error, and time of day.17 Since programs typically measure children at admission (when they're at their worst) and discharge (when they're better), some apparent recovery would likely occur even without treatment. We agree this likely inflates our effect estimates. However, this may be partially offset by the fact that CMAM provides benefits beyond weight gain—including micronutrients, antibiotics, and medical referrals—that our current model doesn't capture.18 Both issues point to serious limitations in using observational data to estimate CMAM's effects. We’re in the process of exploring alternative ways to model the impact of malnutrition treatment.
Critiques we’d identified before (but weren’t in the materials provided to the AI):
- Workforce displacement.19 When NGOs hire local staff to deliver CMAM, those workers may be diverted from other valuable activities.20 We don't currently include an adjustment for this in our cost-effectiveness models because we remain uncertain about the net effect of CMAM programs on the uptake of health services. While we think it's plausible that some health worker displacement occurs, it's also plausible that CMAM programs increase uptake of health services by offering free malnutrition treatment, which could lead to more engagement with the health system. We're hoping to learn more about these competing effects through surveys of treatment rates for common illnesses before and after a CMAM program enters an area and investigation of employment patterns among staff hired by the programs we fund.
Full output is here.
Syphilis screening and treatment
Syphilis screening and treatment during pregnancy can prevent the complications of syphilis infection for the child and mother. We've funded programs to increase screening and treatment rates in low- and middle-income countries, including Liberia, Zambia, Cameroon and the Philippines.21
The AI red teaming generated ~15 critiques of our syphilis research. Our researcher reviewed these and identified three issues worth investigating further, though most concerns were either already addressed in our models, not relevant to our specific grant context, or represented misunderstandings of our work.
Critiques we hadn't previously considered:
- False-positive rates with dual HIV/syphilis rapid tests.22 The AI noted that in low-prevalence districts, test specificity of 95% or less could lead to overtreatment, increased costs, and stigma from false-positive diagnoses.23 Our models implicitly assume test performance comparable to tests used in treatment effect studies, but we haven't explicitly modeled the implications of false positives in different prevalence settings.
- Treatment quality and supply chain concerns. The AI flagged that substandard or falsified treatment could reduce program effectiveness.24 While we expect the impact of this to be modest, we feel that we’ve dedicated insufficient time to investigating them given the amount of funding that we’ve directed to syphilis programs.25
Critiques we'd identified before (but weren't in the materials provided to the AI):
- Male partner treatment rates.26 The AI flagged that partner treatment rates may be much lower than assumed, citing studies showing <20% attendance.27 While reinfection is included in our models (implicitly through the treatment effect used, and explicitly in our adjustment for subsequent pregnancies), we haven't deeply investigated empirical partner attendance data, how this might differ across settings, or optimal strategies for partner notification.28 This connects to work we've been planning to do, and the AI's compilation of relevant studies and numbers will be useful for that investigation.
Full output is here.
Malaria vaccines
Malaria is the second largest cause of death for children under five in sub-Saharan Africa, accounting for 15% of deaths in 2024 according to the United Nations Inter-agency Group for Child Mortality Estimation.29 Malaria vaccines attempt to stimulate immune response against the parasites that cause malaria and thus avert morbidity and mortality.30 We've funded programs to support the rollout of malaria vaccines in sub-Saharan Africa.31
The AI red teaming generated ~15 critiques of our malaria vaccine research. Our researcher reviewed these and identified two issues worth investigating further, though most concerns were either not applicable to this earlier-stage intervention report (which is meant to be illustrative rather than comprehensive), represented misunderstandings of our approach, or were overstated relative to expert consensus.
Critiques we hadn't previously considered:
- Allele-specific efficacy and vaccine-driven selection.32 The AI flagged that if circulating parasite strains differ from the vaccine target, or if vaccination creates selection pressure for resistant strains, field effectiveness could decline over time.33 We haven't seen this raised as a major concern by malaria experts, but we plan to follow up to understand whether this is something we should be modeling.
Critiques we'd identified before (but weren't in the materials provided to the AI):
- Implementation bottlenecks. The AI identified two important operational challenges: (1) human resource constraints within routine immunization programs34 (adding malaria vaccine doses could strain staff capacity and crowd out other services like measles vaccination),35 and (2) systematic exclusion of the highest-risk, "zero-dose" children who don't access routine immunization.36 The first concern was independently flagged by an external reviewer as high priority.
Full output is here.
Seasonal malaria chemoprevention (SMC)
SMC involves administering antimalarial drugs to children during the high-transmission season to prevent malaria infections.37 Malaria Consortium’s SMC program is one of our Top Charities and we've directed over $500 million to several programs to deliver SMC in sub-Saharan Africa since 2017.38
Unlike other interventions in this report, we conducted three separate variations of AI red teaming for SMC. This was because another team was simultaneously conducting a targeted red teaming of SMC program monitoring, implementation, and costs, which allowed us to test whether AI could surface similar issues independently. Our malaria and red teaming teams reviewed these outputs and identified three issues worth investigating further, though most concerns were either already addressed in our models, already brought up by the other red teaming group, or represented misunderstandings of our approach.
Critiques we'd identified before (but weren't in the materials provided to the AI):
- Risk compensation.39 The AI flagged that recipients of SMC might reduce other protective behaviors, such as sleeping under insecticide-treated nets. While we haven't seen empirical evidence of this effect in SMC programs specifically, similar concerns have been raised in other malaria prevention contexts.40 This critique is speculative but testable—we could investigate whether SMC recipients show different net use patterns and consider pairing SMC delivery with messaging reinforcing the importance of continued net use.
- Validating adherence for days 2-3.41 SMC involves a three-day course of medication with the first day’s dose overseen by a healthcare worker and the second and third left to caretakers. Adherence rates for the full course are a critical parameter in our cost-effectiveness models, and our current −6% adjustment42 may be incorrect in either direction. The AI recommended validating adherence through dried blood spot sampling and direct observation rather than relying solely on caregiver recall, something that we plan to consider funding.
- Target population and administrative coverage triangulation.43 The AI identified that errors in estimating target populations can inflate administrative coverage rates and thus our cost-effectiveness estimates. We currently rely primarily on administrative data, which may overstate program reach if denominators are inaccurate. We’re considering several potential methods of generating alternative population estimates, such as estimates based on satellite and GIS data.
Full output is here.
Insecticide-treated bed nets (ITNs)
ITNs are bed nets treated with insecticides that kill or repel mosquitoes.44 The Against Malaria Foundation’s ITN program is one of our Top Charities and we've directed over $380 million to programs to distribute ITNs in sub-Saharan Africa since 2014.45
The AI red teaming generated ~15 critiques of our ITN research. Our malaria team reviewed these and identified three issues worth investigating further, though most concerns were either already addressed in our models or represented misunderstandings of our approach.
Critiques we hadn't previously considered:
- Housing quality changes over time.46 The AI flagged that improvements in housing quality may explain a portion of the observed decline in malaria burden attributed to bed nets. While we think this is plausible, we think the relationship between housing quality, net effectiveness, and mortality reduction isn't so straightforward—nets might be more effective in either poor or good housing depending on how bite reduction translates to mortality.47 Our broader concern is that conditions in the locations where we support ITN distributions are diverging in multiple ways from the contexts in which the academic trials were conducted: housing quality, mosquito species, and human behavior (e.g., electrification leading to later bedtimes and more evening exposure).48 We plan to review whether our blanket external validity adjustment adequately accounts for this growing distance from trial conditions.
Critiques we'd identified before (but weren't in the materials provided to the AI):
- Bioassay-field efficacy gap.49 The AI identified that standard laboratory tests (cone bioassays and tunnel tests) for measuring insecticide potency may correlate weakly with actual community-level malaria outcomes.50 If laboratory test results don't reliably predict field performance, we could be funding nets with the wrong concentration or type of insecticide. We funded bioefficacy monitoring in our first campaign in Ondo State, Nigeria, though results aren't yet available. While we've seen some monitoring data from other sources, we don't currently conduct this type of testing as part of routine quality control for our net distributions.
- Pre-deployment storage effects on insecticide potency.51 The AI noted that factory quality assurance testing doesn't necessarily reflect insecticide potency at the time of installation. ITNs are often shipped by sea in containers that can exceed 50°C, substantially higher than the temperatures used in product specification testing.52 While some of this degradation should be captured in field durability studies and monitoring data, we think it’s possible that randomized controlled trials had simpler and more carefully managed supply chains than routine distributions, potentially leading us to overestimate effectiveness.
Full output is here.
What did we learn about AI red teaming?
While we believe we’re still early in the process of learning about how to use AI, this experiment has provided some initial lessons about how to best leverage AI as a tool for criticism. Although we’re reluctant to draw too many firm conclusions, a few initial takeaways we have are that:
- AI works best as a brainstorming tool, not a fact-checker. The AI frequently hallucinated specific details even when raising valid concerns.
- For instance, it cited "Kenya and Malawi focus groups from 2024" that don't exist while making a reasonable point about crowding out water infrastructure.53
- AI also cited WHO procurement alerts for benzathine-penicillin G (BPG), which we think were hallucinated.54
We see AI red teaming as a starting point for identifying potential issues with our work—our researchers need to identify which critiques have merit and prioritize which to look into.
- AI red teaming may be most valuable for less-researched areas. Several issues the AI flagged were ones we'd already identified through intensive human review but weren't in our source materials. This suggests AI could help surface problems earlier in areas we haven't deeply explored yet. Since these are often smaller grantmaking areas or lower-priority research questions that don't justify extensive human review, AI red teaming might offer a cost-effective way to get additional scrutiny.
- Subject knowledge matters more than we expected. AI red teaming seemed to work better for research areas where we haven't fully reviewed the academic literature. For our syphilis research, the AI pointed us toward studies we hadn't yet examined.55 For bed nets—one of our most thoroughly researched interventions—it rarely surfaced anything that we hadn’t already considered.
- The process works best as a dialogue. Our researchers sometimes found it helpful to ask follow-up questions, request evidence, and push back on weak arguments, rather than accepting initial output.56 This back-and-forth helped refine critiques and identify which concerns were most substantive.
- Quantitative estimates aren't reliable. The AI often suggested specific impacts like "could reduce cost-effectiveness by 15-25%" without solid basis. This may be because of AI's current inability to work with complex spreadsheets—our cost-effectiveness models span multiple sheets that are difficult to feed into AI tools.
- Context gets lost easily. The AI frequently raised concerns about data or methods we don't actually use. For example, it might critique the reliability of a data source that isn't in our models. This reinforced the need for researchers familiar with our actual methodology to review AI output.
Given the fact that AI capabilities have been steadily improving, we expect that the limitations that we’ve noted and the ~15% useful critique rate we observed may significantly understate what will be possible even in the near term. If future models show a greater ability to synthesize information, better understanding of our context, or generate fewer hallucinations, the red teaming process we've developed could become substantially more valuable over time.
Next steps
Going forward, we plan to use AI red teaming as one additional tool when reviewing our work—a quick way to brainstorm potential criticisms that human investigators can then prioritize and verify. We don't expect substantial grantmaking impacts in the near term, but the low cost makes it worthwhile to consider in some cases.
What we'll continue doing:
- Using our current approach when teams want quick additional perspective on their research
- Testing new models as they're released, particularly those better at interpreting spreadsheets
- Monitoring developments through conversations with AI practitioners
There are several improvements to our AI red teaming approach we didn’t prioritize for this initial test but may consider in the future:
- Making our research more AI-accessible. AI output quality depends heavily on the context provided. We could experiment with better ways to feed our internal research to models, but meaningful improvements likely require AI advances in handling complex spreadsheets and large document volumes.
- Multi-agent workflows. We've considered using multiple AI agents in sequence (one for external validity, another for alternative approaches, a third for synthesis). Our current approach appears sufficient, and we don’t know whether added complexity would yield proportional benefits.
- Specialized research tools. While AI-assisted research platforms continue emerging, we haven't identified any that would meaningfully improve on our current approach.
Beyond red teaming, we're investigating other ways AI could strengthen our research and grantmaking processes, including:
- Automating grant page writing. We've started using AI to help draft our public grant pages and plan to continue refining this approach.
- Literature tracking. We're piloting AI systems to monitor new research in 1-2 intervention areas to see if automated literature reviews could scale across our portfolio.
- Vetting empirical studies. We may explore whether AI can help identify methodological issues in the academic papers underlying our cost-effectiveness analyses.
- Quick cost-effectiveness assessments. AI could provide rapid initial estimates to help us prioritize which programs deserve deeper investigation.
- Forecasting support. AI-generated forecasts could complement our team's predictions about grant outcomes and program impacts.
Sources
- 1
We began with a straightforward method of providing an AI with our published research on a specific intervention and asking for critiques. The initial results were mixed. The AI often generated critiques that were:
- Repetitive: Highlighting concerns we had already explicitly discussed in our research.
- Irrelevant: Focusing on academic critiques that didn't materially affect our cost-effectiveness analyses.
- Unfounded or sensationalized: Raising speculative issues without a strong evidence base or exaggerating the potential impact of a specific critique.
- 2
Since we began drafting this page, there have been several model advances, such as the release of Anthropic’s Claude Sonnet 4.5. We tested our red teaming prompts with some of these newer models, but we’ve had less time to engage with their critiques and they were not featured on this page. Our testing of these models has not caused us to update our views on the utility or limitations of using AI for red teaming.
- 3
A list of our past grants to water chlorination programs can be seen here.
- 4
See the AI output related to this critique here.
- 5
While we have not deeply researched this claim, it appears to be true both that flooding results in suspended solids in water (see e.g. Si et al. 2022) and that this reduces chlorine’s effectiveness (see e.g. LeChevallier et al. 1981).
- 6
We have not verified how significant of an issue this may be nor how many of the trials that we rely on occurred during flood years.
- 7
See the AI output related to this critique here.
- 8
The AI cited ”Kenya and Malawi focus groups" as evidence for this critique, but on further probing, it appears that this evidence was hallucinated.
- 9
See the AI output related to this critique here.
- 10
“The strength of a chemical disinfectant (e.g., chlorine, chlorine dioxide, ozone) for inactivating pathogens when in contact with water can be measured by its [concentration-time] value” United States Environmental Protection Agency, “Disinfection Profiling and Benchmarking Technical Guidance Manual,” p.2. See chapter 4 on how to calculate concentration-time.
- 11
Crider et al. 2018 investigated the taste acceptability thresholds for chlorine in Bangladesh finding, “The median detection threshold was 0.70 mg/L (n = 25, SD = 0.57) for water dosed with liquid sodium hypochlorite (NaOCl) and 0.73 mg/L (n = 25, SD = 0.83) for water dosed with solid sodium dichloroisocyanurate (NaDCC). Median acceptability thresholds (based on user report) were 1.16 mg/L (SD = 0.70) for NaOCl and 1.26 mg/L (SD = 0.67) for NaDCC. There was no significant difference in detection or acceptability thresholds for dosing with NaOCl versus NaDCC. Although users are willing to accept treated water in which they can detect the taste of chlorine, their acceptability limit is well below the 2.0 mg/L that chlorine water treatment products are often designed to dose. For some settings, reducing dose may increase adoption of chlorinated water while still providing effective disinfection.”
- 12
See the AI output related to this critique here.
- 13
From WHO, “Guidelines for drinking-water quality,” March 21, 2022.
- “[Trihalomethanes (THMs)] are formed in drinking-water primarily as a result of chlorination of organic matter present naturally in raw water supplies. The rate and degree of THM formation increase as a function of the chlorine and humic acid concentration, temperature, pH and bromide ion concentration.” p. 475
- “It is emphasized that adequate disinfection should never be compromised… Nevertheless, in view of the potential link between adverse reproductive outcomes and THMs, particularly brominated THMs, it is recommended that THM levels in drinking-water be kept as low as practicable.” p. 477
- Two THMs (chloroform and bromodichloromethane) are categorized as group 2B - potential carcinogens. p. 478
- 14
Some examples of our past CMAM grantmaking include:
- A 2024 $2 million grant to the Alliance for International Medical Action (ALIMA) to extend our support for its malnutrition treatment programs in N'Djamena and Ngouri, Chad, for one year.
- A 2024 $4.8 million grant to Taimaka to support three years of malnutrition treatment in three local government areas (LGAs) in Gombe state, Nigeria.
- A 2024 $7.5 million grant to the International Rescue Committee (IRC) to support one year of malnutrition treatment in Burkina Faso, Chad, the Democratic Republic of the Congo (DRC), Niger, and Somalia.
- 15
See the AI output related to this critique here.
- 16
See here our internal meta-analysis which informs our estimate of CMAM’s effectiveness. This analysis relies on baseline and endline height and weight scores.
- 17
“…the weight of a child can change owing to different factors, so re-measurement should be done… at most 3–4 days after the first measurement.” WHO, “Recommendations for Data Collection, Analysis, and Reporting on Anthropometric Indicators in Children under 5 years old,” May 28, 2019, p. 47–48
- 18
We discuss some of the other benefits beyond weight gain in our intervention report on CMAM.
- 19
See the AI output related to this critique here.
- 20
We describe the details of CMAM programs in more detail in our intervention report on the topic.
- 21
Some examples of our past syphilis grantmaking include:
- A 2022 of $15 million to Evidence Action to provide technical assistance to the governments of Zambia and Cameroon to support the scale up of syphilis testing and treatment in pregnancy
- A 2020 grant of $3.9 million from anonymous donors to Evidence Action to provide technical assistance to the Liberian government to support the scale up of syphilis testing and treatment in pregnancy
- 22
See the AI output related to this critique here.
- 23
The WHO notes this possibility in their Consolidated guidelines on HIV testing for a changing epidemic and recommends increasing the number of tests needed to confirm a positive diagnosis when population prevalence is low: “At a population level the number of people testing for HIV who receive an HIV-positive diagnosis affects the likelihood of a correct diagnosis. Specifically, when the proportion of people testing for HIV who receive an HIV-positive result drops below 5%, at least three consecutive reactive tests are needed to maintain a 99% positive predictive value and thus ensure that HIV-positive diagnoses are accurate. For this reason, since 1997 WHO has recommended that countries with lower HIV burden (less than 5% HIV prevalence) use three consecutive reactive tests to provide an HIV-positive diagnosis (Fig. 2). In contrast, in high burden countries with 5% HIV prevalence or greater, WHO recommended using two consecutive reactive tests to provide an HIV-positive diagnosis.”
- 24
See the AI output related to this critique here.
- 25
A list of our previous grants to syphilis programs can be seen here by filtering for Topic = Syphilis. As of October 2025, we’ve directed almost $30 million to these programs.
- 26
See the AI output related to this critique here.
- 27
Parkes-Ratanshi et al. 2020 investigated syphilis treatment for pregnant women and their partners in Kampala, Uganda finding, “Only 18.3% of partners of pregnant women who tested positive for syphilis received treatment. Female partners of non-attendant men had worse birth outcomes. Encouraging men to accompany women to the ANC and testing both may address the urgent need to treat partners of pregnant women in sub-Saharan Africa to reduce poor fetal outcomes.”
- 28
In our cost-effectiveness model from our 2022 grant to Evidence Action’s syphilis treatment program, we estimated that 32% of women treated for syphilis will be re-infected before their next birth. If it’s true that >80% of partners of women with syphilis are not receiving treatment, it’s possible that we’re under-rating this concern.
- 29
The United Nations Inter-agency Group for Child Mortality Estimation (UN IGME) reports that in 2024 malaria was the cause of 15% of deaths of children under 5 in sub-Saharan Africa. See Figure 8 of this report on child mortality.
- 30
See our intervention report on malaria vaccines for more on how malaria vaccines work and how effective we think they are.
- 31
Some examples of our past malaria vaccine grantmaking include:
- A 2024, three year, $18.2m grant to PATH to provide technical assistance (TA) to the governments of Burkina Faso, the Democratic Republic of the Congo (DRC), Mozambique, Nigeria and Uganda to support the nationwide rollouts of malaria vaccines.
- A 2022 $5 million to PATH to support ministries of health in Ghana, Kenya, and Malawi in the implementation of the RTS,S malaria vaccine through the end of 2023.
- 32
See the AI output related to this critique here.
- 33
- Neafsey et al. 2015 find that malaria vaccines’ efficacy depends heavily on matching of the CSP protein: “the 1-year cumulative vaccine efficacy was 50.3% (95% confidence interval [CI], 34.6 to 62.3) against clinical malaria in which parasites matched the vaccine in the entire circumsporozoite protein C-terminal (139 infections), as compared with 33.4% (95% CI, 29.3 to 37.2) against mismatched malaria (1951 infections) (P=0.04 for differential vaccine efficacy).”
- Evidence on vaccine-driven selection is limited; however, modeling work suggests it is plausible. See Masserey et al., 2024 preprint.
- 34
See the AI output related to this critique here.
- 35
There is some evidence suggesting that health worker capacity in immunization programs may be strained as vaccine schedules expand. Since 1974, the number of routine vaccines has grown from 6 to 13, while the health workforce has not expanded proportionally (see Siekmans et al. 2024). WHO estimates an 18 million health worker gap by 2030, primarily in low- and middle-income countries (see Beacham et al. 2023). Some studies suggest that prioritizing new immunization campaigns can crowd out other health services from limited budgets (see Ahmed et al. 2021). However, we have not found direct evidence on whether adding malaria vaccine doses specifically reduces coverage of other routine vaccines like measles. The magnitude of any crowding out effect likely varies by health system context and program design.
- 36
See the AI output related to this critique here.
- 37
See our intervention report on SMC for more on the program.
- 38
See this table for a list of our previous grantmaking to SMC programs.
- 39
See the AI output related to this critique here.
- 40
For example, Diawara et al. 2025 found that malaria vaccine uptake led to reduced interest in SMC for some caretakers in Mali. “Acceptability of the R21/Matrix-M vaccine was driven mainly by the high burden of malaria in the highly seasonal study area and consequent demand for a malaria vaccine, a perceived high efficacy of the R21/Matrix-M vaccine, and a high level of trust and confidence in the trial and trial team. These perceptions of the acceptability of the R21/Matrix-M vaccine led to a reduced perceived importance of seasonal malaria chemoprevention (SMC) among some caregivers, while others viewed R21/Matrix-M, SMC and insecticide-treated nets as complementary.”
- 41
See the AI output related to this critique here.
- 42
We make a -6% adjustment for adherence when calculating the cost per child reached of the intervention. We calculated this adjustment based on previous Malaria Consortium monitoring reports here.
- 43
See the AI output related to this critique here.
- 44
For more information on ITNs, see our intervention report on the topic.
- 45
See this table for a list of our previous grantmaking to ITN programs.
- 46
See the AI output related to this critique here.
- 47
ITNs might prevent more bites on the margin when housing quality is worse (e.g. moving from 12 mosquito bites per night to 3), or they might offer more complete protection in better conditions (e.g. moving from 3 bites to 0). We’re unsure how this plays out in practice and which scenario would translate to greater mortality reductions.
- 48
We rely on a meta-analysis by Pryce et al. 2018 for many of the parameters of our ITNs cost-effectiveness analysis. While we believe it is the most reliable source of this information, 18 of the 23 studies included in the analysis were conducted before 2000. While we make several adjustments to account for differences between the trial contexts and the contexts where we fund ITN campaigns (see here), we remain uncertain whether these adjustments are sufficient.
- 49
See the AI output related to this critique here.
- 50
The WHO now treats lab bioassays as indicators of bioavailability, not predictors of public-health impact. WHO’s updated prequalification guideline and implementation guidance state that decisions should no longer rely on historic cone/tunnel thresholds; bioassays “provide an indication of the bioavailability of [the] active ingredient… rather than a prediction of efficacy,” and product-specific indicators/endpoints should match the mode of action and appropriate test systems. WHO, “Prequalification of Vector Control Products,” November 2023.
- 51
See the AI output related to this critique here.
- 52
“Storage conditions used for product specifications are lower than those encountered under product shipping and storage that may exceed 50 °C, and should be reconsidered.” Skovmand et al. 2021
- 53
See here. When we asked a follow-up question, the AI responded that it could not substantiate the claim of a focus group and our own quick scan of the literature did find any such focus group.
- 54
See the AI output related to this critique here. When asked a follow-up question, the AI could not substantiate the claim and our own quick scan of the literature did find any such alerts.
- 55
Specifically, the Deep Research report found studies that we had not identified. However, our own, later literature review turned up additional studies that the Deep Research AI did not find.
- 56
For example, the AI red teaming for our malaria vaccine research flagged a concern on selective pressures reducing the efficacy of the vaccines over time (see the AI output related to this critique here). This was a concern that we were unfamiliar with, so it was helpful to follow-up and request more fleshed out reasoning and sources from the AI to substantiate the concern.