When Are Direct Replications Necessary?

We are told that replication is the heart of all sciences. As such, psychology has recently seen numerous calls for direct replication. Sanjay Srivastava says that replication provides an opportunity to falsify an idea (an important concept in science, but rarely done in psychology). Brian Nosek and Jeffrey Spies suggest that replication would help identify “manufactured effects” rapidly. And Brent Roberts proposed a three step process, the last of which is a direct replication of any unique study reported in the package of studies.

Not everyone thinks that direct replications are useful though. Andrew Wilson has argued that replication will not save psychology and better theories are needed. Jason Mitchell has gone so far as to say that failed replications offer nothing to science as they are largely the result of practical mistakes on the part of the experimenters. So are direct replications necessary? My answer is a definitive: sometimes.

Let’s start by considering what I gather to be some of the main arguments for direct replications.

  • You might have screwed up the first study. This is one of the reasons Brent Roberts has proposed direct replications (see his response to my comment). Interestingly, this is the other side of the argument posed by Mitchell. That is, you could have typos in your measures, left questions off the survey, the coffee maker could have interfered with the EEG readings,[1] or the data could have been mishandled.
  • Direct replications, when combined with meta-analysis, yield more precise effect size estimates. What is better than one study with N=50? How about two studies with N=50 in each! Perspectives on Psychological Science is now accepting registered replication reports and one of the motivating principles is that “Direct replications are necessary to estimate the true size of an effect.” Likewise, Sean Mackinnon says “It is only through repeated experiments that we are able to center on an accurate estimate of the effect size.”
  • Direct replications can improve generalizability. All other things being equal, we would like our results to generalize to the largest group of people possible. If a study yields the expected results only when conducted on University of Michigan undergrads, we would not be so impressed. Direct replications by different investigators, in different locations, sampling from different sub-populations can offer critical information about generalizability.

But there are problems with these arguments:

  • You might have screwed up the first study. Yes, there may have been methodological problems and artifacts in the first study. But how is running a direct replication supposed to fix this problem?

My guess is that most modern labs gather data in a similar fashion to the way we gather data in my lab. One of my (super smart) graduate students logs into our survey software (we use Qualtrics) and types the survey in there, choosing the scale points, entering anchors, etc. We go through the survey ourselves checking for errors. Then we have Research Assistants do the same. Then we gather say N=5 data points (these data points are usually from members of the research team) and download the data to make sure we understand how the software is storing and returning the values we gave it. Then we run the study. Now, when it comes time to do another study do we start all over again? No. We simply click “copy survey” and the software makes a copy of the same survey we already used for another study. We can do that with lots of different surveys to the point that we almost never have to enter a survey by hand again.

Now these are not direct replications we are running. These are new studies using the same measures. But if we were running a direct replication, how would the process be different? It would be even worse because we wouldn’t even create anything new. We would just have new participants complete the same Qualtrics survey we created before noting which participants were new. So if we screwed up the measures the first time, they are still screwed up now.

Is this a new-age internet survey problem? I doubt it. When I was an undergraduate running experiments we essentially did the same thing only with paper copies. So if the anchor was off the first time someone created the survey (and no one noticed it), it was going to be off on every copy of that survey. And if we were running a direct replication, we wouldn’t start from scratch. We would just print out another copy of the same flawed survey.

Here is the good news though. With Open Science everyone can see what measures I used and how I screwed them up. Further, with open data everyone can see how the data were (mis)handled and if coffee makers created implausible outliers. Moreover, with scripted statistical analysis languages like R everyone can reproduce my results and see exactly where I screwed up the analysis (you can’t do that with point-and-click SPSS!).

Direct replications are not the solution to the problem of methodological/statistical flaws in the first study; Open Science is.

  • Direct replications, when combined with meta-analysis, yield more precise effect size estimates. This is absolutely 100% correct. It is also absolutely 100% unnecessary.

Consider three scenarios: (a) one study with n=20, (b) one study with n=400, (c) 20 studies with n=20 in each that are meta-analytically combined. Which of these will yield the most precise effect size estimate? Obviously it isn’t (a), but what about between (b) and (c)? This was precisely the topic of a post by Felix Schönbrodt. In it, Felix showed empirically that the precision of (b) and (c) are identical. While his post has great insights about precision, the empirical demonstration was (sorry Felix!) a bit useless. The mathematics of meta-analysis are some of the simplest in psychology and it is trivial to show that the standard error of an effect size based on N=400 is the same as one based on…errr, N=400.

So to put this bluntly, if we are interested in more precise effect sizes (and I think we should be), we don’t need direct replications. We need larger initial studies. Indeed, Schönbrodt and Perugini (2013) suggested that stable effect size estimates result when N=250.[2] If editors and reviewers considered sample size a more important criterion for publication there would be (a) fewer Type I errors in the literature, and (b) more precise effect size estimates (i.e., less over-estimated effect sizes due to publication bias). To underscore this point, consider the recent Facebook experiment that received much attention. The study had a total sample of N=689,003. On the scale of r the effect size estimates have a margin of error of ± .002. In one blog post about the study someone commented that the worst part is that because Facebook is so dominant in the SNS market that no one else will be able to replicate it to see if the effect really exists.[3] Seriously?!? Sorry. This study does not need to be directly replicated. The effect size estimates are pretty precise.

  • Direct replications can improve generalizability. YES! Direct replications are incredibly useful for improving generalizability.

The final argument for direct replications is to improve generalizability. This is the only reason that anyone should[4] call for direct replications. If I run a study on undergraduate students in south Florida (even with a large sample), you should in fact wonder if these results generalize to working adults in south Florida and to people from places all over the world. We probably have intuitions about which studies are likely to generalize and which ones aren’t, so we might rely on those to determine which studies are in most need of replication (for generalization purposes!). Or we might focus on trying to replicate studies that, if they are generalizable, would have the most practical and theoretical impact. I’d also suggest that we should periodically try to replicate important studies conducted some time ago (e.g., 20 years or more) just to be sure that the results generalize to modern people.

So to summarize, if we conduct good research in the first place, then we should[5] only need direct replications for generalizability purposes. If we find an error in a previously conducted study, then we should fix it. If that means running a new study with the error fixed, fine. But that isn’t a replication. It is a different (unflawed) study. And if we have good studies to begin with, we don’t need meta-analyses to provide more precise effect size estimates; we will already have them.

So, what constitutes a good study? We all probably have our own ideas, but I will provide my own personal criteria for study quality in another blog post.



[1] I have no idea if coffee makers can interfere with EEG machines. I don’t drink coffee.

[2] I think N=200 is a better number, but that is the subject of another blog post to come in the future (and also Simine Vazire’s suggestion!).

[3] I wish I had recorded the post and comment, but I didn’t. You’ll just have to take my word that it existed and I read it.

[4] I really want to emphasize should here because this is my fantasy-land where researchers actually care about precisely measuring natural relationships. If actions speak louder than words, we know that much research in psychology has little interest in precisely measuring natural relationships.

[5] See footnote 4. Please also note that I am only referring to direct replications here, not conceptual replications or extensions. Indeed, conceptual replications and extensions are crucial ways of demonstrating generalizability.

4 thoughts on “When Are Direct Replications Necessary?

  1. Sanjay Srivastava

    Interesting post Ryne. Regarding your first point – experiments that are fully computerized can be copied precisely and audited by third parties. Even paper questionnaires can be copied with very high fidelity, as you point out.

    But experimental methods are diverse, and there are many aspects of other methods that are not easily copyable or auditable or both. A psychophys study might depend on sensors being attached correctly and consistently. An imaging study might depend on keeping subjects still to avoid motion artifacts. A social psych study might depend on the performance of a confederate. Many studies require that researchers who interact with subjects are blinded to conditions and hypotheses, and blinding can fail. (Sidenote: I wonder how many RAs in “blinded” studies can figure out the hypothesis by browsing their PI’s website.)

    I agree with you that openness can help a lot with these issues. (It solves other issues too — for example, reviewers and readers can make better evaluations when they have full access to stimuli, measures, etc.). But I think there’s enough room for procedural error in many studies that openness won’t fully solve it.

    Also, another reason for running replications that you left off your list is publication bias. What we see in the literature is biased by virtue of having been selected on significance, meaning that studies that overestimate effects (or find false positives, if you’re an NHST kind of person) are overrepresented in the published literature, and studies that underestimate effects (or fail to reject the null) are underrepresented. That might change if preregistration becomes common practice. But for the time being, there is value in running a direct replication of a published study to get an unbiased answer.

    1. Neuroskeptic

      “There is value in running a direct replication of a published study to get an unbiased answer.”

      But only if you publish the replication regardless of what you find 😉

  2. Neuroskeptic

    Great post. But I think that replications can help fix the “You might have screwed up the first study” problem, albeit not directly.

    As you say, it is really openness that helps spot screw-ups. Not replications per se. But in many cases people only bother looking into the details of something, once they decide to replicate it. If you upload the source code for your analysis then I would be willing to bet that, while a few people may browse the code, the first person who will actually run that code (apart from you) will be someone trying to replicate (or extend) it.

  3. Ryne Post author

    Thanks for the comments Sanjay and Neuroskeptic. I agree with the points you two made and very much appreciate them as I think they add to the post. Regarding the subtle details of an experiment that might not make it into a Method section (or that cannot quite be described in words), I *might* suggest that this actually falls under the generalizability heading (although I didn’t mention it before). Brunswick’s discussion of representative design comes to mind. He argued that in some sense it would be better if people we LESS descriptive of the details of their methods so that if the findings did replicate we could be more confident they were less the result of minor details. I remember speaking with colleagues in graduate school about so-called ‘golden hands’ experiments wherein they only worked when someone’s ‘golden hands’ were involved. If an intelligent and properly trained undergraduate cannot get the same results, it is hard for me to believe that they are real.

    Regarding publication bias, if we changed the criteria for publication from the silly NHST, p < .05, framework, to one based on margin of error publication bias disappears. That's another post I haven't finished yet. /Sigh/


Leave a Reply

Your email address will not be published. Required fields are marked *