World Development symposium on RCTs

World Development has a great collection of short pieces on RCTs.

Here is Martin Ravallion’s submission: 

….practitioners should be aware of the limitations of prioritizing unbiasedness, with RCTs as the a priori tool-of-choice. This is not to question the contributions of the Nobel prize winners. Rather it is a plea for assuring that the “tool-of-choice” should always be the best method for addressing our most pressing knowledge gaps in fighting poverty.

… RCTs are often easier to do with a non-governmental organization (NGO). Academic “randomistas,” looking for local partners, appreciate the attractions of working with a compliant NGO rather than a politically sensitive and demanding government. Thus, the RCT is confined to what NGO’s can do, which is only a subset of what matters to development. Also, the desire to randomize may only allow an unbiased impact estimate for a non-randomly-selected sub-population—the catchment area of the NGO. And the selection process for that sub-sample may be far from clear. Often we do not even know what “universe” is represented by the RCT sample. Again, with heterogeneous impacts, the biased non-RCT may be closer to the truth for the whole population than the RCT, which is (at best) only unbiased for the NGO’s catchment area.

And here is David Mckenzie’s take: 

A key critique of the use of randomized experiments in development economics is that they largely have been used for micro-level interventions that have far less impact on poverty than sustained growth and structural transformation. I make a distinction between two types of policy interventions and the most appropriate research strategy for each. The first are transformative policies like stabilizing monetary policy or moving people from poor to rich countries, which are difficult to do, but where the gains are massive. Here case studies, theoretical introspection, and before-after comparisons will yield “good enough” results. In contrast, there are many policy issues where the choice is far from obvious, and where, even after having experienced the policy, countries or individuals may not know if it has worked. I argue that this second type of policy decision is abundant, and randomized experiments help us to learn from large samples what cannot be simply learnt by doing.

Reasonable people would agree that the question should drive the choice of method, subject to the constraint that we should all strive to stay committed to the important lessons of the credibility revolution.

Beyond the questions about inference, we should also endeavor to address the power imbalances that are part of how we conduct research in low-income states. We want to always increase the likelihood that we will be asking the most important questions in the contexts where we work; and that our findings will be legible to policymakers. Investing in knowing our contexts and the societies we study (and taking people in those societies seriously) is a crucial part of reducing the probability that our research comes off as well-identified instances of navel-gazing.

Finally, what is good for reviewers is seldom useful for policymakers. We could all benefit from a bit more honesty about this fact. Incentives matter.

Read all the excellent submissions to the symposium here.

People Are Brains, Not Stomachs

Alex Tabarrok over at MR has a fantastic summary of some of the works of this year’s three Nobel Prize winners in Economics. This paragraph on one of Michael Kremer’s papers stood out to me:

My second Kremer paper is Population Growth and Technological Change: One Million B.C. to 1990. An economist examining one million years of the economy! I like to say that there are two views of humanity, people are stomachs or people are brains. In the people are stomachs view, more people means more eaters, more takers, less for everyone else. In the people are brains view, more people means more brains, more ideas, more for everyone else. The people are brains view is my view and Paul Romer’s view (ideas are nonrivalrous). Kremer tests the two views. He shows that over the long run economic growth increased with population growth. People are brains.

Here is the abstract from Kremer’s QJE paper:

The nonrivalry of technology, as modeled in the endogenous growth literature, implies that high population spurs technological change. This paper constructs and empirically tests a model of long-run world population growth combining this implication with the Malthusian assumption that technology limits population. The model predicts that over most of history, the growth rate of population will be proportional to its level. Empirical tests support this prediction and show that historically, among societies with no possibility for technological contact, those with larger initial populations have had faster technological change and population growth.

Read Tabarrok’s entire post here. Highly recommended.

Since Sunday I’ve been asking around if the Prize got any mention on local radio in Busia, Kenya — the cradle of RCTs, if you will, and where Kremer conducted field experiments. No word yet. Will report if I hear anything.

More on the apparently *transient* effects of unconditional cash transfers

Berk Ozler over at Development Impact has a follow up post on GiveDirectly’s three-year impacts. The post looks at multiple papers analyzing results from the same cash transfer RCT in southwestern Kenya:

First, on the initial studies:

On October, 31, 2015, after the release of the HS (16) working paper in 2013, but before the eventual journal publication of HS (16), Haushofer, Reisinger, and Shapiro released a working paper titled “Your Gain is My Pain.”  In it, they find large negative spillovers on life satisfaction (a component of the psychological wellbeing index reported in HS 16) and smaller, but statistically significant negative spillovers on assets and consumption. The negative spillover effects on life satisfaction, at -0.33 SD and larger than the average benefit on beneficiaries, imply a net decrease in life satisfaction in treated villages. Furthermore, the treatment (ITT) effects are consistent with HS (16), but the spillover effects are not. For example, the spillover effect on the psychological wellbeing index in Table III of HS (16) is approximately +0.1, while Table 1 in HRS (15) implies an average spillover effect of about -0.175 (my calculations: -0.05 * (354/100)). There appear to be similar discrepancies on the spillovers implied for assets and consumption in the HRS (15) paper and HS (16). I am not sure what to make of this, as HRS (15) is an unpublished paper – there must [be] a good explanation that I am missing. Regardless, however, these findings of negative spillovers foreshadow the three-year findings in HS (18), which I discuss next.

Then on the three-year findings:

As I discussed earlier this week, HS (18) find that if they define ITT=T-S, virtually all the effects they found at the 9-month follow-up are still there. However, if ITT is defined in the more standard manner of being across villages, i.e. ITT=T-C, then, there is only an effect on assets and nothing else.

… As you can see, things have now changed: there are spillover effects, so the condition for ITT=T-S being unbiased no longer holds. This is not a condition that you establish once in an earlier follow-up and stick with: it has to hold at every follow-up. Otherwise, you need to use the unbiased estimator defined across villages, ITT=T-C.

To nitpick with the authors here, I don’t buy that [….] lower power is responsible for the finding of no significant treatment effects across villages. Sure, as in HS (16), the standard errors are somewhat larger for across-village estimates than the same within-village estimates. But, the big difference between the short- and the longer-term impacts is the gap between the respective point estimates in HS (18), while they were very stable (due to no/small spillovers) in HS (16). Compare Table 5 in HS (18) with Appendix Table 38 and you will see. The treatment effects disappeared, mainly because the differences between T and C are much smaller now, and even negative, than they were at the nine-month follow-up.

And then this:

If we’re trying to say something about treatment effects, which is what the GiveDirectly blog seems to be trying to do, we already have the estimates we want – unbiased and with decent power: ITT=T-C. HS (18) already established a proper counterfactual in C, so just use that. Doesn’t matter if there are spillovers or not: there are no treatment effects to see here, other than the sole one on assets. Spillover estimation is just playing defense here – a smoke screen for the reader who doesn’t have the time to assess the veracity of the claims about sustained effects.

Chris has a twitter thread on the same questions.

Bottom line: we need more research on UCTs, which GiveDirectly is already doing with a (hopefully) better-implemented really long-term study.

 

 

On Field Experiments

Two quick thoughts:

  1. The world is a better place because more and more policymakers realize that evidence-based policymaking beats flying blind in the dark. Now if only we invested more in passing policy design, implementation, and evaluation skills to bureaucrats….
  2. Whenever academics get involved in field experiments, we typically try to maximize the likelihood of publication (see Humphreys below). But what is good for journal reviewers may not always be useful for policymakers. This is not necessarily a bad thing. We just need to be up front about it, and have it inform our evaluation of the ethics of specific interventions.

Below are some excellent posts (both old and new) on the subject.

NYU’s Cyrus Samii:

Whether one or another intervention is likely to be more effective depends both on the relevant mechanisms driving outcomes and, crucially, whether the mechanisms can be meaningfully affected through intervention. It is in addressing the second question that experimental studies are especially useful. Various approaches, including both qualitative and quantitative, are helpful in identifying important mechanisms that drive outcomes. But experiments can provide especially direct evidence on whether we can actually do anything to affect these mechanisms — that is, experiments put “manipulability” to the test.

Columbia’s Chris Blattman:

I’m going to go even further than Cyrus. At the end of the day, the great benefit of field experiments to economics and political scientists is that it’s forced some of the best social scientists to try to get complicated things done in unfamiliar places, and deal with all the constraints, bureaucrats, logistics, and impediments to reform you can imagine.

Arguably, the tacit knowledge these academics have developed about development and reform will be more influential to their long run work and world view than the experiments themselves.

Columbia’s Macartan Humphreys on the ethics of social experimentation:

Social scientists are increasingly engaging in experimental research projects of importance for public policy in developing areas. While this research holds the possibility of producing major social benefits, it may also involve manipulating populations, often without consent, sometimes with potentially adverse effects, and often in settings with obvious power differentials between researcher and subject. Such research is currently conducted with few clear ethical guidelines. In this paper I discuss research ethics as currently understood in this field, highlighting the limitations of current approaches and the need for the construction of appropriate ethics, focusing on the problems of determining responsibility for interventions and assessing appropriate forms of consent.

…. Consider one concrete example where many of the points of tension come to a head. Say a researcher is contacted by a set of community organizations that want to figure out whether placing street lights in slums will reduce violent crime. In this research the subjects are the criminals but seeking informed consent of the criminals would likely compromise the research and it would likely not be forthcoming anyhow (violation of the respect for persons principle); the criminals will likely bear the costs of the research without benefitting (violation of the justice principle); and there will be disagreement regarding the benefits of the research—if it is effective, the criminals in particular will not value it (producing a difficulty for employing the benevolence principle). Any attempt at a justification based on benevolence gives up a pretense at neutrality since not everyone values outcomes the same way. But here the absence of neutrality does not break any implicit contract between researchers and criminals. The difficulties of this case are not just about the relations with subjects however. Here there are also risks that obtain to nonsubjects, if for example criminals retaliate against the organizations putting the lamps in place. The organization may be very aware of these risks but be willing to bear them because they erroneously put faith in the ill-founded expectations of researchers from wealthy universities who are themselves motivated in part to publish and move their careers forward.

University of Maryland’s Jessica Goldberg (Africanists, read Golberg’s work):

Researchers have neither the authority nor the right to prohibit a control group from attending extra school, and they cannot require attendance from the treatment group. Instead, researchers randomly assign some study participants to be eligible for a program, such as tutoring.  Those in the control group are not eligible for the tutoring provided by the study, but they are not prohibited from seeking out tutoring of their own.

The difference may seem subtle, but it is important.  The control group is not made worse off or denied access to services it would have been able to access absent the experiment. It might not share in all of the benefits available to the treatment group, but that disadvantage is not necessarily due to the evaluation.

Georgetown’s Martin Ravallion:

I have worried about the ethical validity of some RCTs, and I don’t think development specialists have given the ethical issues enough attention. But nor do I think the issues are straightforward. So this post is my effort to make sense of the debate.

Ethics is a poor excuse for lack of evaluative effort. For one thing, there are ethically benign evaluations. But even focusing on RCTs, I doubt if there are many “deontological purists” out there who would argue that good ends can never justify bad means and so side with Mulligan, Sachs and others in rejecting all RCTs on ethical grounds. That is surely a rather extreme position (and not one often associated with economists). It is ethically defensible to judge processes in part by their outcomes; indeed, there is a long tradition of doing so in moral philosophy, with utilitarianism as the leading example. It is not inherently “unethical” to do a pilot intervention that knowingly withholds a treatment from some people in genuine need, and gives it to some people who are not, as long as this is deemed to be justified by the expected welfare benefits from new knowledge.

A call for “politically robust” evaluation designs

Heather Lanthorn cites Gary King et al. on the need for ‘politically robust’ experimental designs for public policy evaluation:

scholars need to remember that responsive political behavior by political elites is an integral and essential feature of democratic political systems and should not be treated with disdain or as an inconvenience. instead, the reality of democratic politics needs to be built into evaluation designs from the start — or else researchers risk their plans being doomed to an unpleasant demise. thus, although not always fully recognized, all public policy evaluations are projects in both political science and political science.

The point here is that what pleases journal reviewers is seldom useful for policymakers.

H/T Brett

Can RCTs be useful in evaluating the impact of democracy and governance aid?

The Election Guide Digest has some interesting thoughts on the subject. Here is quoting part of the post:

The use of the RCT framework resolves two main problems that plague most D&G evaluations, namely the levels-of-analysis problem and the issue of missing baseline data. The levels-of-analysis problem arises when evaluations link programs aimed at meso-level institutions, such as the judiciary, with changes in macro-level indicators of democracy, governance, and corruption. Linking the efforts of a meso-level program to a macro-level outcome rests on the assumption that other factors did not cause the outcome.

An RCT design forces one to minimize such assumptions and isolate the effect of the program, versus the effect of other factors, on the outcome. By choosing a meso-level indicator, such as judicial corruption, to measure the outcome, the evaluator can limit the number of relevant intervening factors that might affect the outcome. In addition, because an RCT design compares both before/after in a treatment and control group, the collection of relevant baseline data, if it does not already exist, is a prerequisite for conducting the evaluation. Many D&G evaluations have relied on collecting only ex-post data, making a true before/after comparison impossible.

Yet it would be difficult to evaluate some “traditional” D&G programs through an RCT design. Consider an institution-building program aimed at reforming the Office of the Inspector General (the treatment group) in a country’s Ministry of Justice. If the purpose of the evaluation is to determine what effect the program had on reducing corruption in that office, there is no similar office (control group) from which to draw a comparison. The lack of a relevant control group and sufficient sample size is the main reason many evaluations cannot employ an RCT design.

More on this here.

food for thought

UPDATE: Gelman responds with the question: Why are there IRB’s at all?

Ted Miguel and other similarly brilliant economists and political scientists (in the RCT mold) are doing what I consider R&D work that developmental states ought to be doing themselves. Sometimes it takes intensive experimental intervention to find out what works and what doesn’t. The need for such an approach is even higher when you are operating in a low resource environment.

That said, I found the points on this post from monkey cage (by Jim Fearon of my Dept.) to be of great importance:

Why is there nothing like an IRB for development projects?   Is it that aid projects are with the consent of the recipient government, so if the host government is ok with it then that’s all the consent that’s needed?  Maybe, but many aid-recipient governments don’t have the capacity to conduct thorough assessments of likely risks versus benefits for the thousands of development projects going on in their countries.  That’s partly why they have lots of aid projects to begin with.

Or maybe there’s no issue here because the major donors do, in effect, have the equivalent of IRBs in the form of required environmental impact assessments and other sorts of impact assessments.  I don’t know enough about standard operating procedures at major donors like the World Bank, USAID, DFID, etc, to say, really.  But it’s not my impression that there are systematic reviews at these places of what are the potential political and social impacts of dropping large amounts of resources into complicated local political and social situations.

You can find the rest of the blog post here.

Look here for more information on RCTs.