What We Can Learn From the Reproducibility Project

I have a new piece up at the Daily Dot about the Reproducibility Project and why psychology isn’t doomed.

The Internet loves sharing psychology studies that affirm lived experiences, and even the tiniest ticks of everyday people. But somewhere in the mix of all those articles and listicles about introverts, extroverts, or habits that “make people successful,” a debate still lingers: Is psychology a “real science?” It’s a question that doesn’t seem to be going away anytime soon. Last week, the Reproducibility Project, an effort by psychology researchers to redo older studies to see if their findings hold up, discovered that only 36 of the 100 studies it tested reproduced the same results.

Of course, many outlets exaggerated these findings, referring to the re-tested studies (or to psychology in general) as “failed” or “proven wrong.” However, as Benedict Carey explains in the New York Times, the project “found no evidence of fraud or that any original study was definitively false. Rather, it concluded that the evidence for most published findings was not nearly as strong as originally claimed.”

But “many psychology studies are not as strong as originally claimed” isn’t as interesting of a headline. So, what’s really going on with psychology research? Should we be worried? Is psychology a “hopeless case?” It’s true that there’s a problem, but the problem isn’t that psychology is nonscientific or that researchers are designing studies poorly (though some of them probably are). The problem is a combination of two things: Statistical methods that aren’t as strong as we thought and a lack of interest in negative findings.

A negative finding happens when a researcher carries out a study and does not find the effect they expected or hoped to find. For instance, suppose you want to find out whether or not drinking coffee every morning affects one’s overall satisfaction with their life. You predict that it does. You take a group of participants and randomly assign half of them to drink coffee every morning for a month, and the other half to abstain from coffee for a month. At the start and at the end of that month, you give them a questionnaire that assesses how satisfied each participant is with their life.

If you find that drinking coffee every day makes no difference when it comes to one’s life satisfaction, you have a negative result. Your hypothesis was not confirmed.

This result isn’t very interesting, as research goes. It’s much less likely to be published than a study with positive results—one that shows that drinking coffee does impact life satisfaction. Most likely, these results will end up gathering figurative dust on the researcher’s computer, and nobody outside of the lab will ever hear about them. Psychologists call this the file-drawer effect.

Read the rest here.

What We Can Learn From the Reproducibility Project

5 thoughts on “What We Can Learn From the Reproducibility Project

  1. 1

    Very interesting!

    The fact that new research has found that some older results don’t quite hold up isn’t a bug, it’s a feature of science working as it should.

    Fully agreed. Still, why wasn’t it happening earlier? Why such a surprise now? What bug – if any – is responsible for this?

    Sorry, my question is probably very naïve (yes, this is alien to me; I work in a formal discipline where the situation is quite different), but I’m really curious. For example, I read many times that in physics repeating the experiment is a standard procedure. Chemistry? Biology? I read about some difficulties, but is the situation as dramatic as the one described here? In other words, is psychology an outlier in this respect? Do the psychologists have some special problems with reproducibility, not encountered (on this scale) in other fields? If so, there must be a bug indeed! But what sort of a bug is it?

    From the New York Times piece (the link provided by Miri):

    The vetted studies were considered part of the core knowledge by which scientists understand the dynamics of personality, relationships, learning and memory.

    So, in psychology you do the research, publish your piece and then … it becomes a “part of the core knowledge” just like that, with no further verification?

    (Yeah, I know – that’s another naïve question. Sorry, it’s just so different from what I’m used to! In my field, it’s not that unusual that you publish your paper and a couple of months – or maybe years – later you receive an email with the information that it was all fucked up. Alright, such emails are usually very polite, of the form `Dear dr X, I cannot understand this and this step in your proof, could you explain, please?’)

    To end on a lighter note: some time ago two of my grads were trying to find in the literature a full and complete proof of a certain known lemma. Eventually they found a couple of old papers stating the lemma as known … and citing each other for support. (Like: the paper A states the lemma and refers to B, B refers to C, C refers to A. How was it possible? Well, one of these papers was circulating for quite a while in its preprint form.)

    My students were very proud of their discovery, but after hearing this I thought: “oh, my, could it be true about most of our knowledge? Are we mere parrots, trapped in the maze of circular cross-references?”

    Well, no, I assure you that we are not. Our knowledge is solid. Circular cross-references are very rare. Good, scientific proofs and arguments are out there in abundance. I know it for a fact. You can quote me!

    I will quote you back in return.

  2. 2

    Excellent article, Miri, but you’ve gotten something wrong:

    suppose my study finds that drinking coffee every day actually tends to increase life satisfaction. I might say that these results have a p-value of 0.04, or 4 percent. That means that there’s a 4 percent chance that I could’ve gotten results like these if drinking coffee actually doesn’t increase life satisfaction. There’s a 4 percent chance that these results are a complete fluke and don’t mean anything.

    That’s a common misconception about p-values among scientists, and even some statisticians make the same mistake. Here’s a quick re-write to fit the actual definition:

    suppose my study finds that drinking coffee every day actually tends to increase life satisfaction. I might say that these results have a p-value of 0.04, or 4 percent. That means that if I assumed coffee didn’t increase life satisfaction and repeated my experiment a bajillion times, four percent of the time I’d get the same result or something less suited to my assumption.

    Under typical conditions, a p-value of 0.05 actually means there’s a 29% chance the null hypothesis is bang-on. I’ve written a little on p-values myself, but I think you’ll prefer Jacob Cohen’s breezy treatment:

    Cohen, Jacob. “The earth is round (p<. 05): Rejoinder.” (1995): 1103.

  3. 3

    Ohhh oh, I got this one!

    Ariel @1:

    Still, why wasn’t it happening earlier?

    It kinda was. That article by Cohen (see comment in moderation) cites studies calling this problem out in the 60’s, and Theodore Sterling’s “file drawer effect” paper came out in 1959. I suspect they were being ignored for three main reasons: these dodgy methods were taught as simple “cookbooks” that anyone could follow, leading to their widespread adoption; if your livelihood rests on getting papers published, dodgy methods with a high false-positive rate are your friends; and while dodgy, these methods still sorta-kinda worked.

    What bug – if any – is responsible for this?

    A number of different authors I’ve read have pointed to Darryl Bem’s 2011 paper on precognition. Here was someone following accepted statistical practices, with a solid number of internal replications, yet coming up with strong proof of obvious nonsense. Worse, a number of researchers who tried to publish their replications of Bem’s work had great difficulty, because journals refused to share null results. It was a wake up call.

    For example, I read many times that in physics repeating the experiment is a standard procedure.

    If I may be glib, many physics experiments are replicated on equipment a university purchased forty years ago. After pressing a few buttons, grad students wait several months as the data trickles in, only to start from the beginning because some asshole turned on a hot plate in the room to cook their Ramen. On the plus side, no-one requires you to trade a few binders’ worth of paperwork with an IRB if you want to torture a few billion particles.

    In other words, is psychology an outlier in this respect? Do the psychologists have some special problems with reproducibility, not encountered (on this scale) in other fields?

    “Sorta” and “not really.” The culture of psychology makes it a bit more vulnerable to these shenanigans than other fields (plus people are complex beasties), but if you picked a random branch of science and dug a little you’re practically guaranteed to find the same issues in some form, in some cases just as bad or worse.

  4. 4

    I am also commenting to let you know what a fantastic experience my daughter found studying the blog. She picked up some pieces, most notably what it’s like to possess a wonderful giving nature to let certain people completely master certain tortuous things. You undoubtedly surpassed our expectations. I appreciate you for providing those helpful, safe, edifying as well as cool guidance on the topic to Emily.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.