Research Is as Good as Its Reproducibility

Complex systems happen to have probabilistic, rather than deterministic, properties, and this fact made social sciences look deficient next to the real hard sciences (as if hard sciences predicted weather or earthquakes better than economics predicts financial crises).

What’s the difference? When today’s results differ from yesterday’s results, it’s not because authors get science wrong. In most cases, these authors just study slightly different contexts and may obtain seemingly contradictory results. Still, to benefit from generalization, it’s easier to take “slightly different” as “the same” and treat the result as a random variable.

In this case, “contradictions” get resolved surprisingly simply: by replicating the experiment and collecting more data. In the end, you have a distribution of the impact over studies, not simply of the impact within a single experiment.

Schoenfeld and Ioannidis show the dispersion of results in cancer research (“Is everything we eat associated with cancer?”, 2012):


Each point indicates a single study that estimates how much a given ingredient may contribute to getting cancer. The bad news: onion is more useful than bacon. The good news: we can say that a single estimate is never enough. A single study is not systematic, even after a peer review.

The recent attempt to reproduce 100 major studies in psychology confirms the divergence: “A large portion of replications produced weaker evidence for the original findings.” In this case, they also found a bias in reporting.

Economics also has reported effects varying across papers. By Eva Vivalt (2014):


This chart reports how conditional cash transfers affect different outcomes, measured in standard deviations. Cash transfers exemplify the rule: The impact is often absent, otherwise it varies (sometimes for the worse). For more, check this:

  • AidGrade: Programs by outcomes. A curated collection of popular public programs with their impact compared across programs and outcomes.
  • Social Science Registry. Registering a randomized trial in advance reduces the positive effect bias in publications and saves data-mining efforts by economists when nothing interesting comes out of the economist’s Plan A.

The dispersion of the impact is not a unique feature of randomized trials. Different estimates from similar papers appear elsewhere in economics. It’s most evident in literature surveys, especially those with nice summary tables: Xu, “The Role Of Law In Economic Growth”; Olken and Pande, “Corruption in Developing Countries”; DellaVigna and Gentzkow, “Persuasion.”

The problem, of course, is that the evidences are as good as their reproducibility. And reproducibility requires data on demand. But how many authors can claim that their results can be replicated? A useful classification by Levitt and List (2009):


Naturally-occurring data occurs naturally, so we cannot replicate it at will. A lot of highly cited papers rely on the naturally occurring data from the right-hand side methods. That’s, in fact, the secret. When an author finds a nice natural experiment that escapes accusations of endogeneity, his paper becomes an authority on the subject. (Either because natural experiments happen rarely and competing papers aren’t appearing, or because the identification looks so elegant that the readers fall in love with the paper.) But this experiment is only one point on the scale. It doesn’t become reliable just because we don’t know where the other points would be.

The work based on controlled data gets less attention, but this work gives a systematic account of causal relationships. Moreover, these papers cover the treatments of a practical sort: well-defined actions that NGOs and governments can implement. This seamless connections is a big burden, since taping “naturally-occurring” evidences to policies adds another layer of distrust between policy makers and researchers. For example, try to connect this list of references in labor economics to government policies.

Though to many researchers “practical” is an obscene word (and I don’t emphasize this quality), reproducible results are a scientific issue. What do reproducible results need? More cooperation, simpler inquiries, and less reliance on chance. More on this is coming.

Impact and Implementation of Evidence Based Policies

Chris Blattman noted that economists lack evidences on important policies. That’s true for foreign aid programs, which Chris mentioned. But defined broadly, policy making in poor countries can source evidences from elsewhere. NBER alone supplies 20 policy-relevant papers each week. And so does the World Bank, which recently studied its own economy:

About 49 percent of the World Bank’s policy reports … have the stated objective of informing the public debate or influencing the development community. … About 13 percent of policy reports were downloaded at least 250 times while more than 31 percent of policy reports are never downloaded. Almost 87 percent of policy reports were never cited.

In an ideal world, policy makers would read more and adjust their economies to the models we already know thanks to the decades of thorough research. This is not happening because policy makers are managers, not researchers with well-defined problems. And, as Russell Ackoff said, managers do not solve problems they manage messes.

Governments have their own limits of the messes they can deal with. Economists in research, on the contrary, simplify messes to tractable models. Let’s take one of the most powerful ideas in development: structural changes. Illustrated by Dani Rodrik:

McMillan and Rodrik - 2011 - Globalization, Structural Change and Productivity Growth
McMillan and Rodrik – 2011 – Globalization, Structural Change and Productivity Growth

The negative slope of the fitted values says that people moved from more productive to less productive industries over time. Which, of course, is a bad structural change. We can blame politics for this or whatever, but it’s hard to separate politics and, say, incompetence.

Emerging (and not so emerging) economies love the idea of employment growing in productive sectors. Even reports on sub-Saharan Africa regularly refer to knowledge economies and high-value-added industries. But in the end, many nations have something like that picture. (Oh, those messes.)

Did economists learn to manage messes better than public officials? Well, that’s what development economics is trying to accomplish. While it doesn’t include “general equilibrium effects” (the key takeaway from Daron Acemoglu), the baseline for judging the effectiveness of assessment programs is way below this and other criticisms. The baseline is eventually the intuition of a local public official—and policies that he would otherwise enact, if there were no evidence based programs.

Instead, these assessment programs provide simple tools for clear objectives. NGOs and local governments can expect something specific.

What about big evidence based policies? They require capacity building. At the extreme, look at the healthcare reform in the United States. Before anything happened, the Affordable Care Act already contained 1,000 pages. Implementation was difficult. Could a government in Central Africa implement a comparable reform, even having abundant evidences on healthcare in the US or at home?

Economists start to ignore the problem of implementation as the potential impact of their insights increases. The connection is not direct, but if you simplify a complex problem, you get a solution for the simplified problem. Someone else must complete the solution, and that becomes the problem.

A Billion-Dollar Bill on the Sidewalk

A new $29 bn. stimulus announced by Japan reminds about how more effective the package could be if we knew more about the impact of fiscal spending. Christina Romer (2012) and Council of Economic Advisers (2014, Ch. 3, Appx. 2) update on traditional aggregate estimates, but any such spending is also an opportunity for randomized trials—that is, a missed opportunity.

The motivation for experiments in macro is, of course, omitted variable bias. Macro has natural experiments to handle it. That’s what you find in research like Romer and Romer, 1989; Card and Krueger on the minimum wage in New Jersey; Card on the Cuban immigrants in Miami. Natural experiments are pure luck in this sense: you need to look for pseudo-random assignments, which are rarely the case. In contrast, designed experiments make all kinds of random assignments at will, including those allowing for interactions between macro policies. Governments spend hundreds of billions on programs outside routine annual budgets. These programs have nice, open-minded goals of supporting specific sectors or people. However, a typical assignment is not random within target groups—and it greatly complicates estimation of how effectively the money has been spent.

The 2009 American Recovery and Reinvestment Act created arbitrary opportunities for a few evaluations to appear, but apart from these bottom-up initiatives, the stimulus was business as usual. Eventually, the 2014 Economic Report of the President recommended RCTs for microeconomic programs and grants (2014, Ch. 7). It was an important step with too little attention to macro RCTs, which will have to wait.

Waiting for randomized macro evaluations costs billions of dollars, as policy makers launch programs based on careful, but imprecise, expectations of the impact. That’s despite per-capita costs of evaluations in macro are lower than similar overheads of microeconomic programs. Assignment in macro is simpler; household and firm responses appear in regular statistical reports. Why not to run more evaluations? No sophisticated problems or conspiracies here. It just takes twenty years for any idea to travel from economists to policy makers. The stopwatch is somewhere in the middle right now.

That means yes, the joke about $10 bill on the sidewalk is actually not about economists.

Not Taking History Seriously

Lant Pritchett asks development economists to fill in a table for impact evaluations (see questions in the header):



Lant has a fair point: Do these popular evaluations measure the drivers of development seen before in now-developed countries. After all, shouldn’t we try the same old drivers in developing countries? Hmmm, many items in the list pass the smell test. Should we doubt the rest? Should, but maybe for other reasons.

Randomized evaluations usually seek to overcome references to experience that Lant’s questions imply. Developed countries have an overwhelming fraction of policies that don’t work when evaluated (see Why It Matters section here). They can afford ineffective policies, but less fortunate countries can’t afford copying them in that.

Nothing is wrong with taking these ideas and testing them in Africa. That happens, as with school meals or conditional transfers (food stamps and government-supported student loans are conditional transfers in the US, for instance). And sure, programs get studied, adapted, adjusted on the way.

The general problem is: If developing countries were like Europe 1870 or Japan 1950, they would grow without much respected economics troops from Boston. Europe had strong states controlling their territories and population in the 19th century. Governments raised taxes when necessary (usually, for wars), workers had strong organizations, inventors created cutting-edge technologies for the world markets. An African may have the same wage that an English worker got in 1820, but they live in different world economies. Plus the development gap is many times wider for the modern English worker.

If you take plain vanilla 19th-century European experience and plant it in Africa, you do what any tribe chief in Congo can do with Adam Smith’s book under his pillow. Naturally, plain vanilla doesn’t work. But if it doesn’t, then what’s wrong with current impact evaluations? Not the past experience—evaluators do check history.

Evaluations have to meet many constraints. Bigger programs, a-la government policies of Asian modernization, have many stakeholders, each with an opinion. Evaluators negotiate the intervention with each one and compromise on the scale. It’s not like you can start an Industrial Revolution in Congo. In this case, RCT designers prefer reliable knowledge about a feasible intervention—well, because that’s what makes a RCT designer. If you can’t negotiate proper design, the result is worth about the same as abundant historical data. Which is low, since that data doesn’t distinguish between impact and noise.

When ROI Hits the Roof


The Coalition for Evidence-Based Policy has a nice compilation of low-cost program evaluations. Example #2 tells us about a $75 million education program that improved nothing. The cost of finding this out was $50K. The simple math says that returns on money invested in evaluation reached 150,000%. It kinda outperforms S&P500.

What’s the trick? First, as mentioned before, maybe the same program completes something else. The program had aimed at improving student results and attendance, and it didn’t improve them. But the teachers got $3K more each and bought themselves useful things. Nothing wrong with that, but we need other ideas to improve education.

Second, so-called unconditional money transfers rarely motivate better performance, though it may seem counterintuitive. Not only in education. Public services just happened to be in full view of everybody. Then ROI in evaluation depends on how much the government or business puts into unchecked programs. This time it was $75 million, next time it’s $750 million. Big policies promise big returns, either due to better selection or faster rejection.

Third, such opportunities exist because big organizations evaluate execution, not impact. Execution is easier to monitor, so public corporations have to have independent auditors who ensure that employees don’t steal. In contrast, efficiency audit requires management’s genuine interest in rigorous evaluation, but there’s no incentives for that. After all, stealing is everywhere a crime, while incompetence is not (despite incompetence being more wasteful).

With that said, ROI of 150,000% is a fact. If you spend on a policy doing X and the policy does nothing to X, you can just leave $75M on the table. Without that $50K evaluation, you’d lose them.

Making Informed Choices in the Complex World

The human nature is complex. The world has billions of interconnections. But it’s surprisingly simple to understand them.

Here’s a practical question. How can the government improve education? Google returns 600 million answers to this question. Unfortunately, most suggestions can’t help. But we still can find out which of them would help. The MIT Poverty Action Lab did a series of evaluations:


This long page says that half of the programs developed by top experts had no impact on test scores (horizontal lines touch the vertical zero on the left-hand side plot) . Though these programs can be useful for something else, they are money wasted as far as learning itself is concerned. The plot on the right is scarier: it’s cost effectiveness of the programs on a log scale. You can see a 100-fold difference in cost effectiveness of scholarships and information provision—in respect to their impact on test scores. It’s like having two shops in one street: one sells 1 apple for $100, another does 100 apples for the same $100.

Take Google or Microsoft, which question the impact of their actions too. Instead of education, they care about profits. They did similar evaluations to find out that 80–90% of their ideas don’t work.

The world is complex and punishes for unjustified self-confidence. Health care, finance, government, nonprofits employ policies that are supposed to work, but they don’t when tested. And these policies are still in force because, well, someone is already paid for being very confident in them. Besides, recognition of the opposite needs courage and doubts, both of which look harmful to career. Costs and complexity aren’t the problem; evaluations are simple and often very cheap. The aforementioned studies separated out the impact in randomized evaluations, but the choices are many:


And more on them later.

Ordinary government failures

(comparing public policies against one of the deadliest diseases; source)

Governments make mistakes. But not those that typically get into the press.

Stories about government failures—sort of Brooking’s and Heritage Foundations stories here and there—are inconclusive. It’s unclear where a “failure” starts because you have no baseline for “success.” In result, the press and think tanks criticize governments for events anyone can be blamed for. Mostly because of huge negative effects.

The financial crisis of 2008 is a conventional government failure in public narratives. September 11 is. But neither was predicted by alternative institutions. Individual economists who forecasted the burst in 2008 came from different backgrounds, organizations, and countries. These diverse frameworks—though being very valuable dissent—are not a systematic improvement over mainstream economics. Predicting 9/11 has even a weaker record (Nostradamus and similar).

Governments make other, more systematic, mistakes. Studying and reporting these mistakes make sense because a government can do better in next iterations. The government can’t learn from the Abu Ghraib abuse, however terrible it was. But it can learn to improve domestic prisons, in which basically similar things happen routinely.

Systematic problems are easier to track, predict, and resolve. A good example unexpectedly comes from the least developed nations. Well, from international organizations and nonprofits that run their anti-poverty programs there. These organizations use randomized evaluations and quasi-experimental methods to separate out the impact of public programs on predefined goals. The results show manifold differences in efficacies of the programs—and it’s a huge success.

Organizations such as the MIT Poverty Action Lab and Innovations for Poverty Action evaluated hundreds of public policies over the last ten years. Now, guess how much press coverage they got. Zero. The NYT can’t even find the Lab mentions among its articles. Google returns 34 links for the same query, most of them to hosted blogs.

One explanation is the storytelling tradition in newspapers. Journalists are taught to tell stories (which is what readers like). Presenting systematic evidences makes a bad story. You have little drama in numbers, however important they are. And telling numbers reduce your readership, which is incompatible with a successful journalist career. Even new data journalism comes from blogs, not well-establised publishers.

More fundamentally, mass media’s choice of priorities leads to little attention to systematic problems in general. Each day brings hot news that sound interesting, however irrelevant and impractical they may be. Reporting public policy research can’t compete in hotness with political speeches and new dangerous enemies around. It took a couple of decades for climate change to become a somewhat regular topic. And survival rates of other important issues are much lower.