Replication Language

It is common to describe replication studies as “failed” when they don’t yield results in the same direction as the original study, or don’t have a p-value under the same threshold. Is this fair? What does a “failed replication” mean? Does it matter?

The answers are no, it depends, and yes.

What does it mean to fail?

Failure is often asserted when a replication study doesn’t yield results consistent with the original study. But to call this a failure is misleading: Another investigation of the same phenomenon or construct can hardly be a failure, unless the study fails to measure what it purported to measure. A replication study, when conducted properly, is a success and valuable contribution to knowledge irrespective of the direction of its results. A recommended step of reasoning at this point suggests the use of meta-analysis to estimate the size of the underlying effect, instead of a dichotomous decision.

Why does this matter? Reporting the results of a replication study as failed implicitly reinforces dichotomous thinking, It’s p-value > .05 all the way down:

A well-known scientist (some say it was Bertrand Russell) once gave a public lecture on astronomy. He described how the earth orbits around the sun and how the sun, in turn, orbits around the center of a vast collection of stars called our galaxy. At the end of the lecture, a little old lady at the back of the room got up and said: “What you have told us is rubbish. The world is really a flat plate supported on the back of a giant tortoise.” The scientist gave a superior smile before replying, “What is the tortoise standing on?” “You’re very clever, young man, very clever,” said the old lady. “But it’s tortoises all the way down!” (Stephen Hawking, A Brief History of Time, 1988)

Let’s see what happens when we replace some of the words:

“What you have told us is rubbish. The truth is really a dichotomy supported on the back of a giant p-value.” The scientist gave a superior smile before replying, “What is the p-value standing on?” “You’re very clever, young man, very clever,” said the old lady. “But it’s p-values all the way down!”

Now, of course answers to important questions can be dichotomous. When you propose marriage to someone, you probably do not want a confidence interval around some probability estimate. But scientists’ relationships to their experimental findings is unlike marriage—there is residual uncertainty and variable conditions and methodology to take into account. Therefore dichotomy most often needs to be rejected in favor of magnitude or probability estimates.

What does all of this have to do with replication? Well, it so happens that when more information than p >/< alpha is included, the researchers can not only compare results against each other, this of course being a detour on our way to truth, but they can also arrive at more accurate estimates of the underlying “platonic” truth about the effect outside of the laboratory.

This also bears relevance to the first question I asked: Is it fair to say that a replication failed? For one reason or another, these words have acquired a somewhat negative connotation, perhaps even implying that the original researchers did something funny, since we now have failed at trying the same. Moving away from using dichotomous language (and research goals) makes it natural to incorporate new and old results into a meta-analysis for effect size estimates. In this view, both new and old studies are valuable, and both contribute to our quest for the “platonic” realm of truths.

Beyond dichotomy

Using quantitative language, instead of dichotomous, begins with you. When you tweet, blog or communicate your findings in another media—maybe even journal articles!—consider whether you want to propagate the limited information contained in a dichotomy (~ p-value). Replications don’t fail (unless the replicators measured the wrong thing by accident, for example), they provide us with new, hopefully more accurate, estimates of the underlying effect size (or probability distribution, if Bayes strikes your fancy).

Update 1: Are we stuck riding on the back of a giant p-value? Here’s an argument for why psychologists can’t afford to study effect sizes. According to the authors of that blog post, we’d need three thousand subjects per condition to arrive at something we could call precision.

Update 2: Shauna Gordon-McKeon recently discussed the same topic over at the OSC blog: Read the post.

Matti Vuorre
Postdoctoral Research Scientist