Replication | Sherman's Head

We are told that replication is the heart of all sciences. As such, psychology has recently seen numerous calls for direct replication. Sanjay Srivastava says that replication provides an opportunity to falsify an idea (an important concept in science, but rarely done in psychology). Brian Nosek and Jeffrey Spies suggest that replication would help identify “manufactured effects” rapidly. And Brent Roberts proposed a three step process, the last of which is a direct replication of any unique study reported in the package of studies.

Not everyone thinks that direct replications are useful though. Andrew Wilson has argued that replication will not save psychology and better theories are needed. Jason Mitchell has gone so far as to say that failed replications offer nothing to science as they are largely the result of practical mistakes on the part of the experimenters. So are direct replications necessary? My answer is a definitive: sometimes.

Let’s start by considering what I gather to be some of the main arguments for direct replications.

You might have screwed up the first study. This is one of the reasons Brent Roberts has proposed direct replications (see his response to my comment). Interestingly, this is the other side of the argument posed by Mitchell. That is, you could have typos in your measures, left questions off the survey, the coffee maker could have interfered with the EEG readings,[1] or the data could have been mishandled.
Direct replications, when combined with meta-analysis, yield more precise effect size estimates. What is better than one study with N=50? How about two studies with N=50 in each! Perspectives on Psychological Science is now accepting registered replication reports and one of the motivating principles is that “Direct replications are necessary to estimate the true size of an effect.” Likewise, Sean Mackinnon says “It is only through repeated experiments that we are able to center on an accurate estimate of the effect size.”
Direct replications can improve generalizability. All other things being equal, we would like our results to generalize to the largest group of people possible. If a study yields the expected results only when conducted on University of Michigan undergrads, we would not be so impressed. Direct replications by different investigators, in different locations, sampling from different sub-populations can offer critical information about generalizability.

But there are problems with these arguments:

You might have screwed up the first study. Yes, there may have been methodological problems and artifacts in the first study. But how is running a direct replication supposed to fix this problem?

My guess is that most modern labs gather data in a similar fashion to the way we gather data in my lab. One of my (super smart) graduate students logs into our survey software (we use Qualtrics) and types the survey in there, choosing the scale points, entering anchors, etc. We go through the survey ourselves checking for errors. Then we have Research Assistants do the same. Then we gather say N=5 data points (these data points are usually from members of the research team) and download the data to make sure we understand how the software is storing and returning the values we gave it. Then we run the study. Now, when it comes time to do another study do we start all over again? No. We simply click “copy survey” and the software makes a copy of the same survey we already used for another study. We can do that with lots of different surveys to the point that we almost never have to enter a survey by hand again.

Now these are not direct replications we are running. These are new studies using the same measures. But if we were running a direct replication, how would the process be different? It would be even worse because we wouldn’t even create anything new. We would just have new participants complete the same Qualtrics survey we created before noting which participants were new. So if we screwed up the measures the first time, they are still screwed up now.

Is this a new-age internet survey problem? I doubt it. When I was an undergraduate running experiments we essentially did the same thing only with paper copies. So if the anchor was off the first time someone created the survey (and no one noticed it), it was going to be off on every copy of that survey. And if we were running a direct replication, we wouldn’t start from scratch. We would just print out another copy of the same flawed survey.

Here is the good news though. With Open Science everyone can see what measures I used and how I screwed them up. Further, with open data everyone can see how the data were (mis)handled and if coffee makers created implausible outliers. Moreover, with scripted statistical analysis languages like R everyone can reproduce my results and see exactly where I screwed up the analysis (you can’t do that with point-and-click SPSS!).

Direct replications are not the solution to the problem of methodological/statistical flaws in the first study; Open Science is.

Direct replications, when combined with meta-analysis, yield more precise effect size estimates. This is absolutely 100% correct. It is also absolutely 100% unnecessary.

Consider three scenarios: (a) one study with n=20, (b) one study with n=400, (c) 20 studies with n=20 in each that are meta-analytically combined. Which of these will yield the most precise effect size estimate? Obviously it isn’t (a), but what about between (b) and (c)? This was precisely the topic of a post by Felix Schönbrodt. In it, Felix showed empirically that the precision of (b) and (c) are identical. While his post has great insights about precision, the empirical demonstration was (sorry Felix!) a bit useless. The mathematics of meta-analysis are some of the simplest in psychology and it is trivial to show that the standard error of an effect size based on N=400 is the same as one based on…errr, N=400.

So to put this bluntly, if we are interested in more precise effect sizes (and I think we should be), we don’t need direct replications. We need larger initial studies. Indeed, Schönbrodt and Perugini (2013) suggested that stable effect size estimates result when N=250.[2] If editors and reviewers considered sample size a more important criterion for publication there would be (a) fewer Type I errors in the literature, and (b) more precise effect size estimates (i.e., less over-estimated effect sizes due to publication bias). To underscore this point, consider the recent Facebook experiment that received much attention. The study had a total sample of N=689,003. On the scale of r the effect size estimates have a margin of error of ± .002. In one blog post about the study someone commented that the worst part is that because Facebook is so dominant in the SNS market that no one else will be able to replicate it to see if the effect really exists.[3] Seriously?!? Sorry. This study does not need to be directly replicated. The effect size estimates are pretty precise.

Direct replications can improve generalizability. YES! Direct replications are incredibly useful for improving generalizability.

The final argument for direct replications is to improve generalizability. This is the only reason that anyone should[4] call for direct replications. If I run a study on undergraduate students in south Florida (even with a large sample), you should in fact wonder if these results generalize to working adults in south Florida and to people from places all over the world. We probably have intuitions about which studies are likely to generalize and which ones aren’t, so we might rely on those to determine which studies are in most need of replication (for generalization purposes!). Or we might focus on trying to replicate studies that, if they are generalizable, would have the most practical and theoretical impact. I’d also suggest that we should periodically try to replicate important studies conducted some time ago (e.g., 20 years or more) just to be sure that the results generalize to modern people.

So to summarize, if we conduct good research in the first place, then we should[5] only need direct replications for generalizability purposes. If we find an error in a previously conducted study, then we should fix it. If that means running a new study with the error fixed, fine. But that isn’t a replication. It is a different (unflawed) study. And if we have good studies to begin with, we don’t need meta-analyses to provide more precise effect size estimates; we will already have them.

So, what constitutes a good study? We all probably have our own ideas, but I will provide my own personal criteria for study quality in another blog post.

[1] I have no idea if coffee makers can interfere with EEG machines. I don’t drink coffee.

[2] I think N=200 is a better number, but that is the subject of another blog post to come in the future (and also Simine Vazire’s suggestion!).

[3] I wish I had recorded the post and comment, but I didn’t. You’ll just have to take my word that it existed and I read it.

[4] I really want to emphasize should here because this is my fantasy-land where researchers actually care about precisely measuring natural relationships. If actions speak louder than words, we know that much research in psychology has little interest in precisely measuring natural relationships.

[5] See footnote 4. Please also note that I am only referring to direct replications here, not conceptual replications or extensions. Indeed, conceptual replications and extensions are crucial ways of demonstrating generalizability.

Sherman's Head

About Psychology, Statistics, and R (by Ryne Sherman)

Tag Archives: Replication

When Are Direct Replications Necessary?