Episode 227: What do we know for sure about human psychology? (with Simine Vazire)

Enjoying the episode? Want to listen later? Subscribe on any of these apps or stores to be notified when we release new episodes:

Listen on

Apple Podcasts

September 12, 2024

How much progress has psychology made on the things that matter most to us? What are some psychological findings we feel pretty confident are true? How much consensus is there about the Big 5 personality traits? What are the points of disagreement about the Big 5? Are traits the best way of thinking about personality? How consistent are the Big 5 traits across cultures? How accurately do people self-report their own personality? When are psychophysical measures more or less useful than self-report measures? How much credence should we lend to the concept of cognitive dissonance? What's the next phase of improvement in the social sciences? Has replicability improved among the social sciences in, say, the last decade? What percent of papers in top journals contain fraud? What percent of papers in top journals are likely unreplicable? Is it possible to set the bar for publishing too high? How can universities maintain a high level of quality in their professors and researchers without pressuring them so hard to publish constantly? What is the simpliest valid analysis for a given study?

Simine Vazire's research examines whether and how science self-corrects, focusing on psychology. She studies the research methods and practices used in psychology, as well as structural systems in science, such as peer review. She also examines whether we know ourselves, and where our blind spots are in our self-knowledge. She teaches research methods. She is editor-in-chief of Psychological Science (as of 1 Jan, 2024) and co-founder (with Brian Nosek) of the Society for the Improvement of Psychological Science. Learn more about her and her work at simine.com.

Further reading

"How Replicable Are Links Between Personality Traits and Consequential Life Outcomes? The Life Outcomes of Personality Replication Project", by Christopher J. Soto
"Machine learning uncovers the most robust self-report predictors of relationship quality across 43 longitudinal couples studies", by Joel, Eastwick, Allison, and Wolf
- Note from Spencer: I misremembered this study as trying to predict breakups when actually the variable they found they couldn't predict is change in relationship-quality over time. The authors said that "relationship-quality change (i.e., increases or decreases in relationship quality over the course of a study) was largely unpredictable from any combination of self-report variables".

JOSH: Hello and welcome to Clearer Thinking with Spencer Greenberg, the podcast about ideas that matter. I'm Josh Castle, the producer of the podcast, and I'm so glad you've joined us today. In this episode, Spencer speaks with Simine Vazire about the experience of reviewing and editing academic literature, and how to develop a culture of scrutiny and transparency in psychological research.

SPENCER: Simine, welcome.

SIMINE: Thank you so much. It's great to be here.

SPENCER: I have to say, I'm so excited about psychology as a field, because I think that it's one of the only fields that deals with many of the most important topics, like: how do we help people be happy? How do we deal with mental health problems? How do we reduce bias and prejudice? How do we become more rational? And yet, I'm also, in a way, very disappointed with the progress that's been made in some of these areas. There have been some great breakthroughs, but it feels to me like things are not moving as fast as they could. We're not actually understanding these things as quickly and as deeply as we could. And I'm wondering, do you think that we've made a lot of progress on these topics?

SIMINE: No, not really. I'm pretty pessimistic about the amount of progress we've made so far, but I think that the positive and the negative go together. I think it's partly because these topics are so important and so interesting; I think that makes it harder to do things the slow and careful way. It's so tempting to say, "We have an answer," and rush to share that with everybody. Not to absolve us of responsibility, but I do think that how hungry everybody is for answers to these questions is partly related to why we haven't made as much progress as we should.

SPENCER: Right. That counterintuitive result that way oversimplifies something about happiness or about prejudice, that's going to get a lot of media attention. Maybe it's going to get a TED talk, and that's pretty appealing, right?

SIMINE: Yeah, it's a lot to expect researchers to resist. I'm not saying they're doing it cynically; I think they believe they have an answer that they want to share with the public, but there's not a lot of things keeping us on the track of the slow and steady progress.

SPENCER: When you think about progress in psychology, what are some things you point to as, "Okay, we've really made progress understanding this topic and it's solid research that we can now build on for the rest of time."

SIMINE: Oh, that's a hard question. [laughs] I think there are plenty of answers in cognitive psychology and psychophysics. Psychophysics — maybe partly because it's less sexy — I think has a lot of success stories. I'm not that familiar with the field, but I just have this sense that it's been the more slow and steady progress approach.

SPENCER: When I think about psychophysics, I think of experiments like flashing lines on the screen and showing that people process information in a certain way, very low-levels...

SIMINE: As I like to tease my partner — who does psychophysics — it's not really psychology. I kind of mean that as a compliment though. But yeah, I'm not sure we can claim that as a victory for psychology. I think the other softer sides of psychology, which is what most people think of, like clinical, developmental, social, there are definitely success stories in those areas — I don't mean to say that there aren't — but those are just harder problems. We don't have huge wins in many of those areas. I think one area in clinical, the effectiveness of therapy and its unrelatedness to the specific approach that people take... There's some evidence that, for specific disorders like phobias or something, there are specific approaches that work better, but for many things like depression, just the therapeutic relationship matters a lot. That's a pretty cool, interesting, important finding that I think is useful also to people in everyday life. Like maybe don't pick your therapist based on which approach they take or what training they have or whatever, but find someone that you have a good connection with. That's, I think, an interesting and important finding.

SPENCER: It's funny you mention that because that is definitely a conclusion many people drop. I'm not even that confident in that conclusion. I don't know about you. [laughs]

SIMINE: No, maybe you've listened to it more than I have. [laughs]

SPENCER: Well, the therapeutic alliance thing is complicated because the question is the causality: is it that, when you're improving, you have a good therapeutic alliance because you're getting better, or is it that your good therapeutic alliance is causing you to get better? And that is, I think, not fully settled.

SIMINE: And actually, that's also a good example of one of my beefs with a lot of the softer areas of psych — I'm in the softer areas of psych, to be clear — is that a lot of things, really, I'm not sure we needed empirical evidence for... Like the idea that you should maybe change therapists if you don't have a good relationship with your therapist; did we really need empirical evidence for that? But on the other hand, if the empirical evidence can help people make that decision that they were reluctant to make before, then it's helpful. It certainly doesn't hurt to have, even if it's just correlation, not causal, that not having a great connection with your therapist is a sign that things aren't going well. Sure, it's very, very obvious and intuitive, but it's still important, and people maybe should feel empowered to take that seriously.

SPENCER: Yeah, I wonder if you're even more cynical about psychology research than I am, because I would point to some things that I think are big wins. Like you mentioned clinical; I think that cognitive behavioral therapy is a huge win. I wouldn't say that I'm really confident that it's better than all the other therapies — there might be other therapies that are just as good — but I feel pretty confident saying it works pretty well.

SIMINE: And exposure therapy for phobias, from what I understand. I feel like I'm going way beyond my expertise here so I feel a little bit reluctant. I guess I should stick to social psych, which is more my area. But yeah, even in social psych, I've been too absorbed in all the negatives, and it's hard for me to zoom back out and be like, "Of course, there are wins." I think, honestly, in social psych, a lot of the progress is in developing experimental methods for studying these fuzzy, soft things. It's easy to manipulate treatment and control conditions for a drug. How do you do that for a happiness intervention? And I think we are getting better at how to do research on social psych topics, and that's really important groundwork, foundational work that doesn't get valued enough.

SPENCER: Well, one thing that I see as a bright spot is seeing these big experimental designs where they'll test lots of different interventions simultaneously to see what actually works in real life. There's this giant study of interventions to get people to the gym more often, and I tested it. I think it was over 40 different interventions, but really huge sample sizes to try to actually figure out what works in real life. It's not exactly answering maybe the deeper scientific question of why. But that seems to me like progress.

SIMINE: Yeah, I think there's not enough value put on descriptive research. I think we need to understand what before we can really ask why. And I think we're getting a little bit better at that. We were pretty obsessed with why and mechanisms for awhile in social psych, but I think we're finally starting to course correct a bit and be like, "Maybe it's okay to just have a robust effect that we don't yet know the why. Let's start with that, and then we can build theories about the why."

SPENCER: What about personality psych? I know that's not your field.

SIMINE: It is, actually. Personality psych is my home field, yeah. I usually say I'm 60% personality, 40% social.

SPENCER: Oh, okay, great. Well, then, in personality, would you say The Big Five model of personality, that that's a big win and a step forward?

SIMINE: Yeah. It's controversial. Actually, if you went to a personality psych conference — which is a really bizarre thing to do if you're not a personality psychologist — we're still debating that. Actually, I think that many, many of us use The Big Five because it really helps to have standardized measures across studies and to communicate and be able to compare findings, but it's not like we all have drunk the Kool-Aid of The Big Five. I think sometimes it looks like that to the outside, but there's actually a ton of debate within personality psych about how many dimensions, whether traits are even the right unit of analysis to look at, and so on. And I actually give ourselves a lot of credit for that, that we don't just settle on something that's convenient and stop asking those questions. But I do think the idea of trying to identify core dimensions on which people differ, we have made a lot of progress on that. The debates now are at the margins, not about the core of that approach. So that's big. It's also not causal. There are some radical personality psychologists who think there are literally five modules in the brain and whatever. But very few personality psychologists think that traits do much more than describe individual differences. So this is maybe also an area where other parts of psych could learn a little bit from personality psych; although, famously, Rob Baumeister said that other parts of psych should learn lessons not to be like personality psych, because we do take descriptive research really seriously. We do these really boring correlational studies looking at just how predictive are traits of behavior, even if we don't think they're causal or whatever; we don't have a strong model about that. But that makes us boring, and we're like the ugly stepchild of psychology. We're smaller. We have a hard time getting jobs. Our publications aren't as prestigious and flashy and all that. I think that that's the right way to go, but it is much less lucrative and much less satisfying in the short term.

SPENCER: For listeners that are not as familiar with The Big Five, the basic idea is, it's based on research where you ask people to rate themselves in different personality attributes, or rate other people in personality attributes. And then you run the statistical method known as factor analysis, and you see what statistically there is in the data. And many researchers find five factors which go by the acronym OCEAN: openness, conscientiousness, extroversion, agreeableness and neuroticism. And then what Simine's referring to is maybe there's some debate around these factors. Simine, just to unpack that a little bit, I know there's the HEXACO model that says, "No, no, we need to add a sixth factor of honesty, humility." Is that one of the key debate points?

SIMINE: Yep, that's one of the key debate points. And then some people argue we need a lot more fine grain, so the five factors are actually not traits; they're clusters of traits. So within (say) extroversion, there's things that are as different as assertiveness and dominance versus sociability and enthusiasm, and those can come apart quite a bit. Some people argue five is way too coarse; we need 20 or 50 or whatever. And then other people go as far as to say that traits are the wrong way to think about things; we should be thinking about narratives that people have about themselves or motives or other things. So there's a whole range of debate from the narrow (is it five or six?) to the broad (are traits even the right way to think about things?).

SPENCER: When you get to the lower level, taking each of the big five or the big six and splitting them into facets or even narrower, clearly, there's a question of just compression, right? You can take any complex concept and you can subdivide it into more nuanced things. And my understanding is, if you do that for these traits, you find increased predictive accuracy. The narrower you slice it, the more you can predict about a person. It makes sense intuitively. If you're trying to build a simplified model, it's not going to capture every nuance. If you allow yourself more variables, you're going to get better predictions.

SIMINE: Yeah, it's almost definitionally true.

SPENCER: It would be pretty fascinating if it wasn't true, but it's not surprising that it's true. But on these other ones, like narratives or motives, how would that even work? Are they even talking about making measurements?

SIMINE: Yeah, to varying degrees. Sometimes it's more about just understanding the whole person. In that case, it's less of a purely empirical endeavor, and closer to a humanistic endeavor. That's a bit more along the lines of the narrative approach. Motives, I think, are more like traits. There's an alternative or a complementary framework. I guess it's starting to get a bit more explanatory. Traits are almost completely descriptive, like, "Does this person tend to do this or feel this?" And motives are a little bit more about what's driving that, and so trying to get a little bit more mechanistic.

SPENCER: I see. So you could measure the motive, which maybe is trying to get a deeper causal structure than just the trait, which might be an outcome of the motive.

SIMINE: Yeah, I think that's one way to think about it.

SPENCER: Another debate about The Big Five I've heard is how stable they are across the world. Is this something that's just in the West that we've measured, and then, if you go to Papua New Guinea and you give people personality tests, do you just get something different?

SIMINE: Yeah, I think that's a glass half-empty, glass half-full debate where people don't really disagree about the evidence. It's just that you focus on the stuff that is consistent across cultures — which is many — most of The Big Five are. But then openness, for example, is the most different, and sometimes it gets renamed 'culture' because it is so culturally specific. But I think we don't actually disagree that much on the facts, but just what's more important, the differences or the similarities?

SPENCER: Another critique that's been levied at personality psych in particular is that there can be a circularity. There was a big study replicating lots of personality psych results, where they show that personality traits can indeed predict many things about a person, very cool. But then, if you look at some of those things it's predicting, it almost seems so close to the personality items that it begins to feel like, "Well, are you just showing that you can predict A using A?" It's like you're trying to predict mental health using questions about how worried and depressed they are. What do you think about that?

SIMINE: I think that's a totally fair criticism of that study, and the author is a very good friend of mine. I think it's really cool to see that evidence, but it absolutely needs to be interpreted in light of the fact that the predictors and the outcomes were measured with self-reports. Interestingly, that study — I think deservedly — got a lot of attention. It also deserved the criticism that it got. But it's also not a good representation of what the primary literature looks like. In the replication work, they did use self-reports for both measuring personality and measuring the outcomes that personality predicts, which is not ideal at all, and not really accepted practice in a lot of primary personality psych research. If you try to publish in a personality journal, most personality psych editors and reviewers will be like, "You can't just correlate self-reports to self-reports. It's no longer acceptable as a thing. I think it's reasonable that an exception was made for this huge replication project because the only way you're going to be able to test hundreds of correlations and questions is to simplify the methods. I think the author was upfront about that, but some of the coverage of those results didn't emphasize that limitation enough. But it's not a characteristic of modern personality research anymore. It was kind of an exception in the case of this really large-scale replication study.

SPENCER: I'm not sure I understand that because, if you ask someone a very clear-cut fact about their life, and there's no incentive to lie — they're just going to get paid the same amount of money to do the survey or the same college credit regardless of their answer — I don't see why we couldn't trust that. Obviously, people sometimes misremember things about their life, or they might be too embarrassed to admit it, but largely, don't people report accurately on just simple facts about their life?

SIMINE: Yeah, I have an extreme view on this. I would say no. I don't think it's that they're lying. I think that there are very, very few things that are so objective that there's no room for your self-concept to shape your answer. And it's interesting; I used to do a lot of this kind of research. I studied self-knowledge, the accuracy of people's self-views of their personality, and it is very tempting to measure the criterion measure — what they're actually like — with self-reports. It would be way easier if I could just ask you how many times last week you had an argument with a friend. And if you think about any of these psychologically interesting events and try to think of objective ones, it's really, really hard to think of one that's so objective that there's no room for your self-concept to influence how you define an argument, how you remember things and, to some extent, what you're willing to admit. But even if you remove any external pressure to self-present a certain way, I just think there's so much room for your self-concept to affect just how you interpret the question and what comes to mind and how you answer it. So I would very rarely treat a self-report as an objective measure of almost anything. I'm sure you could come up with an exception, but the more robust your example is, the less psychologically interesting it's going to be. So yes, I trust your answer of whether you brushed your teeth this morning, for example, but I don't really care whether you brushed your teeth this morning. I care about how often you get so upset that you cry, things that are much more in the gray area, where there's definitional ambiguity and recall problems and things like that.

SPENCER: What about things like, "Have you been arrested in the last ten years?" It seems to me like that kind of thing would be pretty reliable.

SIMINE: Yeah. There is self-presentational concern. Yeah, maybe you can keep throwing specific examples at me, and I can keep telling you; my objection is going to shift around a lot. But I think that one, yeah, people are motivated to lie about it. And honestly, I could imagine someone forgetting about being arrested if it wasn't very recent, or it wasn't technically an arrest; I could imagine some room for error.

SPENCER: So is the gold standard now actually to get real records on these things?

SIMINE: The gold standard is basically to triangulate however you can. It depends what you're measuring, but you would almost never want to use the same method for the predictor and the outcome. Usually, people use self-reports the most, so the main thing that would be hard to get away with would be to have self-reports of both predictor and outcome. And that doesn't mean you have to get an objective court record or something like that. You could get peer reports. You could have the experimenter actually interact with the participant a lot and form their own impression. Documents and archival records can be useful, too. Physiology is another one. But honestly, sometimes I think we go too far in the other direction, where we think, "Well, as long as it's not a self-report, it's good and it's objective." I also have been known to take issue with that, like, "Okay, you measured physiology, but that's gratuitous physiology. It doesn't actually get you any closer to the construct. A self-report would have been better." I think it's just hard to think through what is the best measure and what is feasible and what is different enough in method from the predictor and so on. All of those factors come into play, and there's not one right answer.

SPENCER: Yeah, I've seen this interesting (what I would say is) a bias in favor of things that feel objective, like giving a physiological measurement like galvanic skin response. What I find ridiculous about those measures is that they're so far removed from what we actually care about. The fact that it's measured with a machine is not that impressive to me, when we know (a.) it's full of noise, and (b.) it measures something very vague; your galvanic skin response goes up if you see an attractive person, or you stand up quickly, or you have an upsetting thought. What is it really measuring?

SIMINE: I think it's so hard because there are so many ways it can go wrong. We can tell people, "Don't just use self-reports," and then they'll go and get a physiological measurement. And you're like, "No, no, no, I didn't mean that." I feel very often, physiology is the wrong approach for many of these psychologically rich constructs that we want to measure, and I get frustrated that people treat it as a direct pipeline to the truth. Recently, I came across a paper where they were trying to use some self-report measure — I don't even remember of what — to predict an outcome related to depression. Let's say they were just calling it depression — I'm going to oversimplify a bit — and the way they measured depression to avoid the circularity with self-reports was to get records, official records, of whether they had ever filled a prescription for an antidepressant. That's pretty far from depression. You can be depressed and never fill a prescription for an antidepressant. You can fill a prescription for an antidepressant and not be depressed. Many of those medications are prescribed for other things as well. You could say, "Well, why didn't they just ask people if they were depressed or had ever been depressed?" And in that case, I definitely think self-reports are better than this archival measure, but the archival measure has the advantage of being methodologically independent from the predictor which was measured with self-report. I think the trade-off might be worth it in that case. It's a crude measure, it's noisy, etc., but it has the advantage of not being circular, so maybe it's good enough that it's worth it, or maybe you do it both ways and look at the difference.

SPENCER: Right, so maybe if you're asking something self-reported like, "Were you ever depressed?" or even, "Were you ever diagnosed with depression?" partly, you get a true measurement of what actually happened, but partly you get maybe some self-concept mixed in there, of how they think about themselves, or are they willing to admit this on the survey? And then that part of it that's a self-concept, might be correlated with their personality questions.

SIMINE: Oh, it absolutely will be correlated. Even just response sets, even just a tendency to circle a high number versus a low number creates an artificial correlation between two self-reports when there's no other shared variance. We know that, if you correlate two self-reports, you have to worry about shared method variance.

SPENCER: I've observed something that is kind of dramatic and insane on personality surveys where, whenever you ask someone about a rare trait that very few people have, what happens is... Let's just say it's simple true-false, and only five percent of people should really say 'True,' because it's a rare trait. The number of people who genuinely say 'True' starts getting matched by the number of people who are not paying attention, click the wrong button by mistake, are just fucking with you, are total spammer. So you can end up in a situation where five percent actually had the thing but eight percent of people are full of shit who said yes. And now that trait looks like it correlates with all kinds of crazy nonsense.

SIMINE: We had an interesting experience like that in one of our studies where we asked participants after they completed some simple social psych study. We wanted to know how participants feel about questionable research practices, so we asked them, "The study you just completed — minimal risk, nothing intensive or whatever — how would you feel if the researchers decided not to publish it because they didn't like the results? Or decided to change their analysis plan so that they would get the result they predicted?" or things like that. And one of the questions was, "How would you feel if the researchers used your data to commit fraud and publish fraudulent results?" And we got something like (I don't know) more than one percent — I think it was in the single digits but it was a non-trivial percentage of people — who were for that, who were like, "I would prefer that they do that." And we're like, "Okay, probably the vast majority of those people are not paying attention," or whatever. But yeah, how do we know if any of them are real? It's very hard.

SPENCER: Yeah, it's almost like there's this crazy factor of people who are either fucking with you or just clicking at random, or spammers. And that weird, crazy factor is going to correlate with all kinds of crazy stuff [laughs] but it's not really about human psychology.

SIMINE: Or it's about human psychology in a different way.

SPENCER: In a different way, yeah. Going back to social psych, what do you think are some of the topics that social psych could make progress on, or maybe in ten years, will have progress on, that you'd like to see?

SIMINE: Well, certainly, prejudice is a big, big one, and I think there's some exciting work now happening, especially interdisciplinary work, drawing on methods and approaches from other disciplines: political science, sociology, even economics, things like that. Doing field studies or just doing a lot more rigorous work than what we've seen traditionally in social psych. So that's exciting and I'm cautiously optimistic about that. It's just so important that, obviously, we need to invest in better research in that area. I'm not creative enough to think of things that we're not already studying. But things like relationships, and there, I would like to see a lot more research, not just on romantic relationships, which consumes 90-plus percent of the relationships literature, but other kinds of relationships, I think are really important to study and we don't know that much about. I did a little bit of research on friendships myself, and we know a little bit about kids' friendships, but not that much beyond that.

SPENCER: What would that look like, to make progress on that? Would it look like being able to say things about what makes a good relationship and actually be able to predict who's going to be happy in a relationship?

SIMINE: Ultimately, it would be nice to be able to tell people what are some good approaches for going about making friends, for choosing friends, for dealing with conflict and repair in friendships, how important it is if you only have time to invest in friendships or in exercise, or things like that. I think all those questions would be great. I remember, for example, one paper that I handled as an editor that was published. It was just a descriptive study looking at when people are considering ending a romantic relationship, what are the factors that they're weighing? And so it's very qualitative and descriptive, and I thought that was fascinating. Even that kind of basic descriptive research about what factors people consider when they form a friendship, or end a friendship, or change a friendship, or things like that. We haven't even asked those basic questions about friendships — at least adult friendships — hardly at all.

SPENCER: I wonder if you've seen this study where they gathered lots and lots of data sets on romantic relationships. It asked questions of both partners, then they looked at, over time, whether the relationship stayed together or not, and they tried to use a bunch of machine learning algorithms and other approaches to make predictions about — given the traits of the two people and their views about each other — what leads to people staying together. And to me, the mind-blowing result of that was that they could barely predict it. It was kind of shocking.

SIMINE: There's some research from what I remember — and this is a long time ago, so maybe if I reread it now, I'd be more skeptical — that looked at just, if you ask people, "How likely do you think you are to break up with this person?" and also they asked their friends, both were fairly predictive. The friends are more predictive, which I think is fascinating, because I'm sure that the people in the relationship have access to that information. I think it's a reporting thing or admitting-things-to-yourself thing, not a lack of information. But I don't think it's that hard to predict. I think many people, those in the relationship and those close to it, can predict. But that brings up another point. One of my beefs with relationships research is that they treat staying together as a positive outcome. And that's completely bonkers, especially when you consider that many other participants are college students, very young. But even for older people, it's just not at all obvious that you should treat staying together as the positive outcome. For some people, breaking up is a very positive outcome; or just having had a good relationship for however long it lasts, is the ultimate goal, not for it to last forever. So there's a lot that needs to be 'rethunk' about how we approach relationships research.

SPENCER: Right. I guess even staying together could be a bad outcome.

SIMINE: Yeah, it can be terrible.

SPENCER: To your point, though, I don't think what you said contradicts this study, because this study was trying to look at aspects about the person and their views on each other, not, 'Do you think you're gonna break up?' which — I do agree with you — that seems like it could be predictive. But they were saying, "Well, maybe there's something about these people. You can tell if they're compatible." And that just seemed really, really difficult.

SIMINE: But I guess that's good because staying together or breaking up is not about the individuals. It's something emergent about... I guess they hope they would have picked that up, too, with interaction effects or something like that. But I think one really big challenge of social psych, and psych in general, is that there are probably super complicated interaction effects all over the place, but they're like eight-way interactions. They're not two- or three-way interactions, and we're never going to be able to detect them empirically, I think, even with all the data in the world. But in a way, I think that's a nice thing about human life, that things are not reducible to just a few predictors.

SPENCER: Yeah, and on the point of interaction effects, for those that don't know, imagine you're trying to, for example, predict who's gonna stay together. One thing you could look at is Person One's conscientiousness, and another thing you could look at is Person Two's conscientiousness. But then you can also look at the interaction between the two, like Person One's conscientiousness times Person Two's conscientiousness. And maybe it's the interaction of conscientiousness with each other that predicts whether they break up. That's the kind of thing we're talking about. And I don't know about you, but when I read interaction effects in a paper, I immediately become a little more skeptical.

SIMINE: Yeah, definitely.

SPENCER: It's hard enough to find direct effects, but interaction effects, the number of them tends to blow up. If you have ten variables, you could just have a massive number of possible interactions, and then, why did they choose this interaction, not that interaction?

SIMINE: And I think, empirically, we know that they're less replicable.

SPENCER: Another thing that I think about with regard to social psychology is some of these classic findings that have gotten a lot of attention. Obviously, there's been a whole rejiggering of methods and thinking about ways that a lot of the past work wasn't replicable. But I'm wondering whether you think some of these really big findings have stood the test of time. For example, on conformity and obedience research, the famous Milgram experiment and Asch conformity experiments.

SIMINE: Yeah, I probably shouldn't give a strong opinion without having looked at the evidence. But my guess would be that some of those classic '60s findings would stand the test of time, despite the fact that probably a lot of the time, the original evidence was very, very thin, the sample sizes were small, etc. But I do think there were some robust effects there, nevertheless.

[promo]

SPENCER: It's really interesting to me that there seems like there used to be this type of study that's not done nearly as much anymore where, if they could get a thing to happen even a few times, it was incredibly impressive, because a priori, it seems so unlikely that anyone would behave this way, that if you could just show three people behaving that way, it's like, "Whoa! That's bizarre." And I think of the Asch conformity experiment where you're showing people these lines and asking them which one's longer, and people should be able to get this right, essentially 100% of the time. But if you have a bunch of confederates get the question wrong first, some of the people — not all the people — will then get it wrong. I'm assuming that they actually did the experiment; they actually got the result. It's pretty darn convincing, even without the use of statistics, because it would be so unlikely that someone would make that mistake otherwise. What do you think about that?

SIMINE: I think maybe one of the reasons we lost that kind of study is the move to online studies and the crazy factor. So now you just don't know if people are spamming you, like if they say the shorter line is longer, you're like, "Either this is a fascinating phenomenon, or these are people saying, 'Please use my data for fraud.'" [Spencer laughs] I think it's going to become harder and harder for those kinds of studies to be convincing. There was always noise, there was always error, etc. And now there's this extra source of noise and error where you could get completely crazy results that you just don't know what to attribute them to.

SPENCER: Oh, yeah, that's a really good point. And similarly, for the Milgram experiment, where they basically asked people to administer questions to someone, and whenever the person got the questions wrong, they were supposed to give them an electric shock. And the level of electric shock went up and up and up. And the fact that they could even get some reasonable percentage of people to continue shocking the person, even past the point where the person stopped being respondent, that's amazing from the point of view of, "Wow, they really saw something in human psychology." And I know that that experiment, the Milgram experiment, has come under criticism. People have said, "Well, maybe they misinterpreted why people were really behaving this way. Maybe some people realized that this was a confederate and nobody was really being shocked." But it still seems pretty darn compelling to me, even with those critiques, that they figured out something about human psychology that you wouldn't a priori necessarily think is true.

SIMINE: Yeah, I agree. I haven't dug into the critiques that well, but I do think that that one might stand up. I think it is really interesting. And I think in a lot of social psych findings, or psych findings in general, there's the question of, it might only matter if it generalizes. But I think you're right that some of these examples, maybe even if they don't generalize, just the fact that in any context, if you can reliably get this effect, I'd still want to see that it's replicable. I'd still want to see that it's not a statistical fluke, and I do think the questions about whether people really believed it and so on are important. But if you could rule out those issues, even if it doesn't generalize to other contexts, I think it's a fascinating phenomenon.

SPENCER: Because then you could start to wonder, well, maybe the Nazis managed to leverage whatever this thing is, or maybe cults managed to leverage whatever this thing is. Because there's something lurking in human psychology that is there; we have proof of concept. Another topic I think about is cognitive dissonance, because cognitive dissonance seems like it's almost just permeated society and now, just everyone accepts it as a fact about human nature. Do you think that we should believe in cognitive dissonance?

SIMINE: I don't feel like I can speak to that very confidently. I haven't dug into that literature. I guess this is one of the areas where, maybe independent of the empirical evidence, I feel like we shouldn't just throw out our common sense. And I think my common sense is somewhere in the middle, like surely there's something like wanting to rationalize something you did after the fact, so then you can post hoc cognitive dissonance reduction. I think there's something like that. I think many of us have had this subjective experience where we can reflect on, "That's what we did. We did something that didn't make a lot of sense." Then we rationalized, "Oh, I must value that a lot, or I must care about that a lot, to have done that." I think that phenomenon must exist. And then the question is, how big of an effect is it? How commonly does it occur in everyday life, and so on? I doubt the empirical research really speaks to those questions so far.

SPENCER: What about a more modern topic, like growth mindset? In a way, I think of growth mindset as a positive development because it's pointing at some measurable aspect of human psychology that seems like it might be important, but you wouldn't necessarily think of it. And then once you have this research, you say, "Ah, yeah, that is an interesting facet of the way people operate." And then maybe that facet is predictive of some interesting outcomes, even if it's not super predictive, even if it's only a little predictive, that still seems to me like a win. I'm curious what you think about that line of research.

SIMINE: Again, I'd have to refresh my memory on the details, but my general sense is that, directionally, I think it's true. And again, I'm not sure we would need a lot of empirical convincing about that, like, is it better to have a mindset that intelligence and capabilities are malleable or fixed? I just think it's obvious. But then the question is, how important is it relative to other factors? So if we're going to intervene on something, is that the thing to intervene on? How big are the effects? How long lasting are the effects? How much are the effects targeted at the groups that we most want to help, etc? There's obviously a lot of empirical research on that stuff. I don't have a strong sense of how clear the answers are about those downstream questions.

SPENCER: Now let's jump into talking about how we improve the field. And I think of you as someone really at the forefront of this. Now that you're the editor-in-chief of Psychological Science, you can put things directly into action, which I think is fantastic. What are some of your thoughts about the next phase in psychology as a field, and how we do better?

SIMINE: I think it's been interesting. The first (say) ten years or so of psychology's introspection and trying to improve, a lot of it was focused on transparency and open science and trying to get back to our roots as scientists — not specifically psychologists, but scientists — that we are committed to opening ourselves up, opening our claims up to scrutiny, verifying each other's work and so on. And we lost sight of that, and I think a lot of that is because the technology just wasn't there in previous decades to share all of our data, all of our materials, and everything; that just wasn't possible. And so we had to catch up to what the technology allowed and reconnect with our core values from third grade, about showing your work and all of that. And so we spent a lot of the last decade, I would say, reconnecting with those core values. And I think we made a lot of progress on that. I think that the challenge is going to be then moving the goalposts a bit and saying, "Great job. Good that we're being transparent. Now let's think about the next step. What is that transparency for? The transparency is so that we can check each other's work and catch errors," and so on. And I think people are not quite completely bought into that yet, even though it seems obvious that that's why we value transparency. But we sold transparency as good for its own sake, and now we're changing the rules and saying, no, actually, we wanted it for this other reason. I remember sitting on a committee once where we were talking about whether researchers across the sciences should be required to share their data and other details of their methods and so on. And one of the scientists on the committee said, "Well, we can't ask people to share all that. That's asking people to give their critics ammunition." And I was like, "Yeah, that's exactly why we want people to share all that." But it's true that we hadn't made that part of the marketing of transparency and open science in the beginning.

SPENCER: Yeah. It's so fascinating to apply the lens of psychology back to psychology as a field, and say, okay, psychological analysis of what is actually going on here. Because I think about this metaphor of, imagine one day you learn that all the gold mining in the US, it's not really mining gold; you're just like, "Oh, wait, no, actually, they've been mining copper all along." And then you're like, "What the hell? How could that possibly happen? What have they been doing?" They were taking copper and they were painting it gold colored. And then everyone thought they were buying gold, but nobody was really buying gold. It would be so shocking, and you'd be like, "What happened?" And so I'm wondering, if we think about the field of psychology, it sounds like you think that the field didn't make progress for a long time, or maybe (let's say) a small amount of progress for a long time. It had to go through this reckoning and go back to its first principles. But what was the field doing all that time, publishing literally 1000s of papers?

SIMINE: To use your metaphor, I guess I think one spin we could put on it is we were learning how to mine, and that's good. We just weren't finding the right thing. But we still learned a lot of valuable lessons. I think a lot of people feel so much loss for the specific findings they cherished — either their own or other people's — that they no longer can easily believe in or whatever. But I don't know; for me, the counterfactual is not, "Oh, what if we could still believe all these things?" The counterfactual is, "What if we continued to believe them when we shouldn't?" Another positive spin on it is, yeah, we learned some good things and some bad things about how to do studies and behavioral research. It's hard. Why should it be easy? How could it possibly be easy to understand what motivates people, or how to be a good leader, or how to develop good relationships, or any of these? It can't possibly be as easy as we had fooled ourselves into thinking it was, so I still think all that work wasn't wasted. We did learn a lot about what to do and what not to do.

SPENCER: I wonder if part of the issue here is generalizability. If you look at early psychology researchers, they seem to be making these pronouncements about universals, about human nature, like cognitive dissonance. And maybe we have to just set our sights a little smaller and say, "Okay, can we at least learn some genuine facts about people in one country that really holds for that country, and is totally robust?"

SIMINE: Just to get really, really pessimistic or cynical for a second here, I think one of the things that was so mind-blowing about the replication crisis in psychology is that it started with replicability. It started with saying this signal that you thought you saw in your data, it's not even there in your data. That's not even getting to generalizability. It's saying you didn't even observe what you think you observed. And if we repeated it exactly without trying to generalize to anything new, we wouldn't be able to see the signal again, because it was never really there. Some people are saying that's so fundamental, and that's what we need to fix. And then other people are pointing out, no, but even if it was there in your study, you can't then extrapolate from that to this broader phenomenon you're trying to claim you can address. You don't know why it happened in that study. Maybe it just happened because it was a Tuesday, or because it was in Michigan, or because you used that particular measure of happiness, but it wouldn't hold for another measure. There's so many potential things that could have made the signal real, but for uninteresting reasons. That's going one step beyond replicability. You still have the problem of generalizability. And then there's other people pointing out, "No, no, no, but even if all that's true, you're not actually measuring what you think you're measuring at all. You don't have construct validity." Or there's other people pointing out that you don't have the strength to make causal inferences here; even if it's true and it would generalize to other contexts, it's not causal. Basically, I feel like the replication crisis turned into this crisis of everything, and we realized... I haven't even touched on theory or whatever else. There's so many angles from which you could say it doesn't even matter if the other part holds. Let's say you achieve replicability, or you achieve generalizability, but you don't have the causal inference, so you don't have the theory or whatever. And so it starts to feel completely overwhelming because all of these things are necessary conditions for the research being valuable and achieving the aims of the researchers. But I do think that one of the solutions, just like for generalizability, I think you're right, that maybe the first step is just to make more circumscribed claims. And the same can be said for many of these other challenges. Like, okay, we don't have the design yet to make strong causal inferences. Let's just make descriptive observational claims, and that could be the building blocks for a causal theory later on. So let's not try to solve the problem in one set of studies, one article or one person's career, even. Let's focus on building good blocks that could be then put together later to achieve those goals.

SPENCER: Right. Because if you go back 15 years and you randomly sample papers from top journals, my best guess is something like 40% of them wouldn't even replicate, which means that you're trying to build a house on a foundation of sand. You don't even know what paper you can read and trust, so how do you build on top of that? You first need the building blocks to be secure, to have actual facts that we can all agree on are correct, and then you can start building up from there. I don't know about you, but I think that actually, replicability may have improved a lot over that time, and I've mentioned this to you previously, but we've started to see the signs of this with our own project, Transparent Replications, where we replicate new papers coming out in top journals shortly after they're released. And to our shock — we did not expect this — we have not had much problem with replication. In our first (I think it was) 11 papers we replicated, only one failed to replicate really significantly. And even if we look at that one, the reasons it failed to replicate are pretty subtle, and I don't think that they're the reason that papers 15 years ago were not replicating. It was more like a complicated, confounding effect, not like p-hacking. So this has given me hope that, wait, maybe replication has actually improved a lot. And I'm wondering, do you think it has?

SIMINE: Yeah, probably, yeah. I'm willing to believe it has. I haven't seen a lot of empirical data, but I wouldn't be surprised if it has. One of the things that some of us were saying when the low replication rates were coming out in the early days was that part of the reason that's so shocking and concerning is that this is a really low bar. Replicability does not mean it's true; it just means there's a signal there and I can find the signal again. It doesn't mean the signal means what you say it means, or is interesting, or whatever. But at the same time, it's not a given that we would have fixed this very simple problem, so it's good that we at least did that. I think some of the other problems — like generalizability theory, etc. — are going to be much harder to fix than replicability. Fixing replicability is very, very straightforward if you want to do it; we know exactly what steps we need to take.

SPENCER: Maybe you just want to walk through a few of those steps. Let's say we want to solve that problem, really nail it. What do we do?

SIMINE: One is that we plan everything ahead of time and follow a detailed plan. If you're not in a position to do that — you don't know exactly what analysis you want to do, you don't know which measure is going to be the right one — then you're not at the stage yet where you're trying to detect a reliable signal. You're still trying to develop the design of your study. That's fine. So you don't have to plan everything out doing studies until you get to a point where you think you know exactly what you need to do to get the effect. Then you write out that plan — that's called pre-registering — and try to make it as detailed as possible, and try to stick to it as much as possible. If you end up realizing it wasn't a good plan and you need to deviate from it, then great. Now write down that plan and do that again and show that you can get it when you make all the decisions ahead of time. If we follow that recipe, which involves directly replicating stuff, because we're not going to get the plan right the first time. So you keep tweaking it, but then eventually you do it again exactly the same way once you think you have the recipe down. Direct replication and pre-registration and being skeptical of findings that have hallmarks of flexibility, those are things like: deviating a lot from the plan, or having a very vague plan to begin with, and the results being close to the threshold of, 'is there a signal there or not?' So if you're using p-values, that's results that are barely significant. If we apply that kind of post hoc skepticism as well, because we can't always tell if the plan was detailed enough; it's useful to use the results a little bit as a cue to whether we achieved this good direct replication plus pre-registration ideal. If you use all those, I think it is fairly easy to tell what's replicable and what's not.

SPENCER: Critically, the pre-registration is submitted with the paper so everyone can see, "Oh yeah, they planned to do this analysis that they actually did. They didn't just change their mind after the fact, based on what made their result look good."

SIMINE: Yeah, although we now know that you have to actually look at the pre-registration because authors will sometimes write the paper as if everything was planned, but if you look at the pre-registration, it wasn't. Either the plan was vague or so on. And I don't think authors are doing this maliciously, but they misremember or, in their head, it was a really specific plan, but to a reader, there was a lot of room for ambiguity. So yes, it is critical to actually go look at the pre-registration.

SPENCER: And you also mentioned p-values and being skeptical if the signal is just barely detectable. We took a look at 325 psychology replications, cases where there was an original study, and people went and replicated the study. And what we were interested in is, if the p-value had different sizes in the original study, did that affect the replication rate? And we found that indeed it did. It would probably vary based on what database of studies you look at. But when we looked at it, we found that if the p-value was 0.01 or smaller, 72% replicated — which is pretty strong — whereas, if it was bigger than 0.01, only 48% replicated. That was just a sign that, yes, the p-value really, really does matter.

SIMINE: I'm glad, because if you had told me the opposite, I would have had to rethink everything I thought I knew about statistics. [both laugh]

SPENCER: It makes sense from a statistics standpoint. The smaller p-values are harder to force to get the result you want. But it's nice to just see empirical validation of that.

SIMINE: That might change. Honestly, I wouldn't necessarily say it's an absolute necessary truth about statistics, because people's behavior matters. And as it's becoming, I think, more and more understood that a p-value above (say) 0.01, "Eh, keep collecting more data. Do another study," or something, because it shouldn't convince anybody. As that becomes the case, then people will — not necessarily intentionally — but p-hack to below 0.01, so then it won't be as diagnostic anymore. There's this arms race between the statistics and the behavior that isn't totally conscious or intentional. But looking at past research, I think that's a pretty good heuristic.

[promo]

SPENCER: If we think about how is it really possible that so many papers published by competent people in top journals could have been just noise essentially, false positives that were found, some that wouldn't hold up if you redid it. The one thing I think about is the sample sizes. So many psychology studies used to have too few study participants, and so you were In this realm where you didn't even have enough data to test the claim that you were making.

SIMINE: Yeah, but everything we were taught about statistics — and I used to teach it this way, too — is that low sample size is going to lead to false negatives. So if you have a small sample size and you're able to extract a significant result, in the classic teaching of statistics, you do not have to worry about false positives. And in fact, technically, it doesn't change the false positive rate. So if you followed all the rules of statistics and you did your p-values correctly, you followed an a priori detailed plan, which is what a p-value assumes... The p-value assumes that the researcher did not look at the data, made their plan ahead of time, followed that plan exactly, and then the p-value has its statistical properties. In that case, it doesn't matter what your sample size is. The chance of a false positive, if there's no effect there, is five percent or whatever your alpha is. It doesn't matter.

SPENCER: Only under the null, though, right?

SIMINE: Under the null.

SPENCER: I'm gonna push back a little there, because if we consider that you don't know whether the null is true or not, that analysis doesn't hold.

SIMINE: Right. So a small sample does affect the false discovery rate which is, out of all the significant results, how many of them are true discoveries versus false discoveries? The smaller the sample size, the worse that is, because you're having fewer true discoveries. Because you have low statistical power, more of your studies are failing to detect something that should be there, so you don't have that many true discoveries. You still have the five percent false discoveries when the null is true. And so then, out of all the significant results, more of them are false discoveries. But the five percent error rate that the p-value are supposed to protect you... They're supposed to hold it so that when the null is true, when there's no signal, you should only detect one five percent of the time; that is independent of sample size. So researchers were taught that a small sample is not really a problem for false positives. It's a problem for false negatives. And of course, there's the complication of the false discovery rate issue, but that's hard to understand. I teach it every year, and I still even catch myself thinking about it wrong sometimes. It's complicated, and it's easy for even a very competent researcher to come away from their stats training thinking, "Well, small sample size plus significant result means I'm good to go," like there's nothing to worry about there.

SPENCER: Yeah, it's really interesting. And I think part of what's going on there is that it doesn't take into account the behavioral aspects. If you analyze it from a pure statistics standpoint, okay, yeah, maybe it doesn't have that much effect. But on a behavioral aspect, if you give me a sample size of 30 participants, it's so easy to just remove one outlier and change the result. If you give me 300 participants, it's so much harder to have the result change based on minor deviations.

SIMINE: There's a great paper that came out last year by Angelika Stefan and Felix Schoenbrodt looking at the influence of different things like sample size and other things on how easy it is to p-hack or to get false positives. It's a really fascinating paper. And actually, a lot of the great stuff is buried in the appendix, where you can see they modeled all these different things about: what if researchers behavior is more like this, or what if it's more like that, and all the effects on the robustness of the results. It's a great paper.

SPENCER: Yeah, it shows the danger of being overly formalistic, and when you don't take into account human behavior, you can really mismodel things pretty badly.

SIMINE: Yeah, it's fascinating how many of these open science and reform and replicability issues are really psychology issues about changing norms or understanding incentives, or so on. And so many people, including scientists, are so naive about scientists that they think that we're not susceptible to all the biases and everything else that humans are. But of course, we're human. I remember one time the editor-in-chief of Nature — I don't remember who it was at the time — came and spoke at my university, and in the Q&A, I asked him, "Have you considered doing masked reviews so that the authors' names and institutions are not known to the reviewers, at least?" And he's like, "Oh, no, we're not worried about that because we've desk rejected papers by Nobel Prize winners so we know we're not biased." I think he was some kind of natural scientist. And I was like, "Oh, he doesn't have Psychology 101-level understanding of humans, and that scientists are human."

SPENCER: Well, yeah. [laughs] You'd think that psychologists would be the best at this, at identifying the fact that, oh, wait, psychology is probably influencing the way we practice our own work, right?

SIMINE: Yeah, I've heard psychology editors say things just as naive as that. So yeah, we're not much better when applied to ourselves.

SPENCER: One thing I've been thinking a lot about lately is trying to categorize the different problems in the field of psychology. If you try to create mutually exclusive, collectively exhaustive categories for the research being done, the best that I've been able to come up with, thinking about this in a simple way, is into four buckets. The first bucket of research is fraud, which I think is pretty rare. And by fraud, I don't mean using subtle gray area statistics to get a statistically significant result. I mean making up your data, lying about what your experiment actually was, never conducting your experiment in the first place. Although that gets a lot of attention, and obviously it's very shocking when it happens, would you agree that's pretty minor, probably less than a few percent of papers?

SIMINE: I don't think so. Mostly, I'm just outsourcing my beliefs to people who know better and, from the people I trust who look at this closely, I think it's a much bigger problem. I think it's an order of magnitude bigger than we think.

SPENCER: Oh, wow. Obviously it's hard to make these estimates but what's your best guess at what percentage of papers might contain fraud?

SIMINE: I think it would be closer to ten percent.

SPENCER: Wow, oh my gosh.

SIMINE: It depends how you count. It depends what the denominator is. I think if you include papers that never, ever get cited, it's higher.

SPENCER: Let's just say top journals though, top ten journals, top 20 journals.

SIMINE: I think it's a little lower then, a bit lower. I also think it does depend how you define fraud, and I agree with you that a lot of practices that up until now have not been considered fraud, I won't count. But there's a question about whether, going forward, that should continue to be the boundary that we use. If there's a practice that really we understand that it's misrepresenting, and everyone should know that, do we continue to not count it? But yeah, let's put aside those practices. I still think it's a little naive to think that, given how good we are — and we've demonstrated how good we are at rationalizing things that we really ought to know better — don't you think that one in 20 people then slide into rationalizing things that we would count as fraud? I don't know. But honestly, that's not my intuition; it's just what I've learned from people who have a lot more experience with this.

SPENCER: Yeah, and I suspect it's a lot more papers than it is people. Because if you're committing fraud, you have a huge competitive advantage over other research papers, right? My best guess would be something like one percent of researchers, three percent of papers, or something. But yeah, maybe it's higher than that. Okay, so that's one bucket. A second bucket would be things that straight-up just wouldn't replicate. You do a really faithful redesign of the experiment, try to get the same result on a similar population, really high-powered and well-done; you just don't get the same effect. Fifteen years ago, I would have put this at 40% that don't replicate. That's me trying to triangulate the different studies that have different large-scale replications. Would you put a similar number 15 years ago or would you...?

SIMINE: Yep, I'm happy with that number. I would have a big confidence interval around it, but I'm happy with that number.

SPENCER: Okay. And then, what's your best guess now? Just papers coming out in the last six months, if they were all replicated, let's say the top ten psych journals, what do we think we would find in high-powered replications?

SIMINE: That's a good question. I would say a lot higher. Maybe it's twice as good now. I can't remember if you said 40% were irreplicable or replicable. Either way, that's within my confidence interval.

SPENCER: I would say, 15 years ago, 40% don't replicate. So you're saying maybe now, maybe only 20% wouldn't replicate, or something like that?

SIMINE: Yeah. Again, I don't have a precise estimate, but I think it's gotten substantially better.

SPENCER: And of course, the kind of 'right' replication rate is not 100%. That just doesn't make any sense because there's flukes, and there's bad luck, and there's all kinds of things that can happen. But yeah, I think we can agree that 40% is way too high. Not everyone agrees with that.

SIMINE: Well, I'm not sure but I think, for some journals, it could be fine to say we are the journal of really cutting edge stuff.

SPENCER: Sure. The journal of non-replicable results. [laughs]

SIMINE: Yeah, but I don't think it would be a bad thing for there to be some journals like that. Honestly, that's probably how we see Science and Nature and PNAS in the field of social sciences. I think that's the niche it fills, really exciting, big-if-true stuff. It just needs to be marketed that way. And there's real value to that kind of research. But the authors and the journal should market it as really preliminary, 'this needs to be followed up on,' and there should be a good reason why it wasn't already followed up on, like it's really, really hard to do, or something like that. But there's a place for that.

SPENCER: Yeah, that makes perfect sense, as long as everyone knows that's what it is. [laughs] And then there's another bucket which I think has gotten a lot less attention. It's the bucket I refer to as "importance hacking," and its results that would replicate. Like if you redo the same study, similar population, you'd get the same result. But it doesn't warrant publication. And that is a little bit of a fuzzy, subjective category. There could be a lot of reasons it doesn't warrant publication. But one example would be, it's completely trivial. It's just an obvious result. It only seemed to warrant replication because of the way it was talked about, like misled reviewers or tricked them into thinking it meant something it didn't. Or it could be that it replicates but, for a really dumb reason, or it's actually a finding that everyone already knows, but now they just give it a new name, and people didn't realize it. So yeah, there are a lot of different things there. What were you gonna say?

SIMINE: Yeah, can I ask, would you put in that bucket stuff that maybe should have been published if it was framed really differently? And I ask this partly selfishly because, as an editor, I don't know what to do with these papers. The framing makes it not publishable, not even close. They claim that it's evidence for this huge phenomenon, or it's super counterintuitive, or whatever. But there's cases where it's so trivial that, even with the right framing, it shouldn't be published. But would you also include cases where, actually there was something kind of interesting there? It's just not nearly as interesting as they say it is, or it's interesting for different reasons than they say. Would you put that in the importance-hacking bucket, or is that just stuff that, if it were framed accurately, no one would care?

SPENCER: It's a good question because, with any of these categorizations, there's going to be gray area stuff. Even between fraud and not fraud, there's a gray area, like, "Is that really fraud?" But I guess, from the point of view of this categorization, what I would say is the importance-hacking bucket is things that actually did not warrant publication. If the reviewers had truly, deeply understood what was actually done and was actually demonstrated with the statistics, they'd be like, "Yeah, that shouldn't be published." The majority of reviewers would be like, "Yeah, it's not worth publishing," but it does get published.

SIMINE: In that case, then, the authors were right to importance hack because, if they had framed it accurately, it wouldn't have gotten through peer review. I think that's a really interestingly different case in situations where, "No, there's something that could be valuable there, and if you had sold it more accurately, we might have considered publishing it, but you just oversold it, and that killed your chances." And of course, usually there's at least some degree of, 'if you had sold it more accurately, there's a much worse chance that reviewers would find it interesting,' so there's always some risk. But I think it's interesting as authors. How do authors negotiate that? And usually they're not conscious of it. But let's pretend authors could accurately see the actual value of their work, and the actual value is at the threshold of maybe being interesting enough to publish, so they could importance hack and try to get through that way, or they could be honest. And, yeah, I think that gray area is fascinating to me; could we ever create an incentive system where authors are going to be accurate about the actual value when they could oversell it and they have a chance of getting away with it?

SPENCER: Right. Well, if there was genuine punishment for overselling your work, then that might shift the incentive structure where they actually know that it reduces the probability of publication, not increases the probability of publication. But I do think it's worth separating out those cases.

SIMINE: Yeah, I think this is something that journals are guilty of messing up the incentives here, because I think there's this attitude among journal editors that, if there's a version of it that would be publishable, then we should give the authors a chance. And I actually think that creates really perverse incentives to start with overselling, and then, worst case, you'll get asked to reframe it, and that's fine; you still get through. One of the things I think that editors should do is they should be willing to reject outright on the basis of fixable things when the authors were responsible for the problem. So if the authors are overselling, to me, that's a rejection-worthy thing if it's extreme enough. And yeah, of course, it's fixable. And of course, the authors can and will argue with you and say it's fixable. "Why aren't you giving me a chance?" "Well, you had the choice, and this is what you submitted. It reduces my trust in you as a researcher, first of all, that you oversold this much. But also that's not our job to give you another chance." And it creates a really perverse incentive. There will always be the right strategy to oversell first if we treat that as a fixable problem that journals should not reject for.

SPENCER: It's like the salesman who tries to tell you that the hair dryer they're trying to sell you could also make you a cup of coffee. And then you're like,"Wait, no, but it doesn't make a cup of coffee." They're like, "Oh, it's okay. It will dry your hair really well." And you're like, "Screw you."

SIMINE: Yeah. "I'm not supposed to hold that against you somehow?"

SPENCER: Yeah, exactly, exactly. Let's talk about that bucket just taking into account things that the majority of reviewers actually would not be willing to publish if they really understood what had been proven, what exactly the study had done. What percent of papers do you think fall into that nowadays, in the top ten journals?

SIMINE: It's hard because I want to value a lot of things that other people don't value. So there's a lot of these papers I see where I'm like, "You way oversold it. If you had sold it to me accurately, I actually would have liked it," and maybe other reviewers wouldn't. Let me give a concrete example. There was a paper I received as an editor where there was a really simple, basic finding that was interesting and important, although it wasn't novel. We already knew it, but I thought it was important to show it again. It wasn't like we knew it absolutely for sure or whatever. And the authors had submitted it, but with a bunch of fancy extra interaction effects and things like that that I didn't think were replicable and real, and didn't really have the right design to establish those effects anyway. So I said to the authors, "Hey, if you take all that stuff out and just show the main effect, I think that's an important enough and good enough contribution that I would send it out for a review and be open to publishing it." So they said, "Okay, okay." They submitted it again, and there's still some of that extra stuff that I think was overselling in there. And so I went back to them saying, "No, no, I'm serious. You have to take out all this other stuff and just frame it around the main effect." And then they came back to me and they told me the truth about what was stopping them. They were like, "Well, we think that reviewers won't like it if we do that." And I was like, "Okay, you might be right, but I'm the editor, and I'm telling you I won't reject it for lack of novelty or lack of flash," or whatever. It turns out we were both right. They were right; the reviewers did ding it for that and didn't want it to get published, but there weren't any other serious concerns, so I accepted it anyway. Ironically, it turned out to be one of the most cited papers we published that year. So everybody's wrong about it not being interesting to readers. It's hard for me to answer the question of how many are uninteresting to reviewers? Because I find a lot of stuff that is uninteresting to reviewers should be valued more. A pretty small percentage of that, if it were framed accurately, I would have thought they should never have published this, because I think many things are worth publishing if framed accurately, but they are pretty boring. But I think most of the literature should be pretty boring.

SPENCER: Right. So you might have a little bit of an unusual view on this that we just need a lot more of these reliable results even if they're a bit boring, to just start building up this set of empirical findings.

SIMINE: I think the problem is when they're things like artifacts. Here's another example. This was a study; I don't know anything about the authors or even where it was published. I just saw a news headline about it saying that some really high percentage of Australians had experienced child abuse. And it was such a high percentage that I just didn't believe it, and I dug into the method. You had to get into the supplemental materials before you found this information. But the piece that I found explained this weird result to my mind — I'm not sure that this explains it, but to my mind — was that when they sent out, they had a representative sample of people that they texted a link to a study. Of course, then there's self-selection so it didn't end up being a representative sample. They did a good job of trying to reach a representative sample of Australians. But the link was something like childabusestudy.au.org or something like that; it wasn't that exactly. And I was like, "Oh, they texted people with that information in the text. So then the selection effects on who participates are totally skewed." And that's the kind of problem. Let's say I'm right, that that's what caused the really high estimates, then it shouldn't be published. There's no way to salvage that. There's no way to frame that as, "When you have really biased selection effects, you get a really high estimate." That's not interesting. So I do think actually the percentage that are like that, where the real explanation is some methodological artifact or some damning problem with the design, still, I would say it's at least 15, 20% of papers in top journals, I would guess.

SPENCER: Got it. I think I would put it higher than that, but we might have different bars for what we think should be in the literature. I guess the question I ask myself is, "Do I feel glad I read this paper?" And assuming that I'm interested in everything — because, obviously, there's going to be topics people aren't interested in — but assuming I'm interested in the topic of the paper, am I glad I read it? Do I feel like I learned something worth reading? I would say maybe 40% of papers don't meet my bar for like, "I felt like it was worth reading."

SIMINE: If I take the perspective of a reader, I think it's at least 40%, probably more. The problem with that perspective is that you're not responsible for the incentive structure that you create by setting that filter. The problem is that we need there to be things in the literature that are not that interesting to most readers, or just aren't big enough to leave you feeling like, "Oh, I'm glad I read that paper," because, otherwise, the incentive is to importance hack. So in order to create the right incentive structure, we need to allow, even into top journals, things that are just incremental and solid, and we're glad they're there, but we kind of feel like we wasted our time reading them.

SPENCER: Well, I don't know, because if we could tighten up the checking for importance hacking, where you genuinely knew, "Okay, so people are going to compare my actual statistical result and the actual methodology to what I'm claiming," and you knew that you were going to get penalized a lot if there was a big gap between those two, then maybe we could set a bar at: for someone actually interested in the topic, it should be worth reading.

SIMINE: I think it depends how tight that connection could ever be. I feel like there's just always going to be some element of trust that — partly the authors have more expertise, partly they have more information — there's always the chance that they're withholding something, and you won't be able to detect that they're withholding it and so on. So to me, there's always room for authors to get away with something. It'll never be impossible or even that hard for authors to importance hack, so I feel like it creates bad incentives if we have too high a bar for publication.

SPENCER: I might have a bit of a different perspective, because doing this replication project, we're not experts in the specific topics that they're running their research on. But we're going through the paper very carefully. We're examining every bit of material. We're rewriting their statistical code, then we're going and rebuilding their study — which is a lot of work — but it usually becomes pretty clear at some point in that process, what exactly did they do. And then we literally just compare: what exactly did they do, what exactly the statistics say, and what did they say about what they did? And when you put those side by side, I find it often is pretty clear,

SIMINE: But you don't know about any misses. You don't know about any papers that you would call that, and you don't realize they're importance hacking, but they still are.

SPENCER: That's true, but we find a lot where we're like, "Oh, wow, this is importance hacking." [laughs]

SIMINE: I just worry about the people that are sneaky enough to get away with it. And then if we create even more incentive for that, I don't know; I worry about that.

SPENCER: Yeah. It's an interesting question of, okay, the better you get at detecting it, yes, it creates incentive to make it even less detectable. But I still feel like, on average, you improve things. Hopefully, you push those that are doing it into being less gray area. There's gonna be the bad actor. Look, if they just made up their materials and lied about their experiment, that's actually really hard to attack. But they're committing an act that will literally get them kicked out of the field.

SIMINE: But what about just being a really, really good storyteller? Yes, hopefully, as readers and reviewers, we're critical enough, but surely it influences us. And I worry that rewards over... that gives much incentive for that. I want people to be good writers. I don't mind incentivizing that, but there's a version of it that gets uncomfortable.

SPENCER: Well, yes, and I think we're all susceptible to stories. My solution: when I read papers, I don't read the story part initially. I start with methods, then I read the results. So I've read essentially nothing about what they claim about the result. And I also check the materials; that's also a key aspect. My suspicion is, if reviewers were trained to always look for the materials, if the materials are not there, to say, "Hey, I can't review this. There's no material. I want to literally see the survey you put in front of your participants, and then read the methods and results first," I suspect, actually, they would catch a bunch of this stuff. But if they start with the storytelling aspect, by the time they get to those parts, they're gonna gloss over it a little bit, especially if they're not looking at the materials.

**SIMINE:**Yeah, yeah, that makes sense.

SPENCER: All right. Before we wrap up, I thought it might be fun to do a rapid-fire round where I quickly ask you a bunch of difficult questions, but don't give you much time to respond. What do you think about that?

SIMINE: I'll try that. Okay.

SPENCER: All right. Cool, cool. There has been some interesting movement to try to get rid of p-value, or push for a Bayesian way of looking at research instead of a frequentist way. I'm curious to see your quick reaction on those approaches.

SIMINE: I'm pretty skeptical of most of them. I don't think the problem is the specific tools we're using. It's the human factor in it, which I think all tools are susceptible to.

SPENCER: Yeah, I think I agree with you there. The issue is not so much the tool, but the people who will find ways to misuse the tool. [laughs]

SIMINE: I do think Bayesian SAS is a good example of a tool we're underusing. I think there's a lot of value to it that is underused. I'm not at all against those tools. I just don't think they're the silver bullet that sometimes they're framed as being.

SPENCER: Another thing that's come under fire is peer review. And people will point to studies that show, for example, papers resubmitted to the same journals will get rejected when they already got approved. And people complain about peer review a lot. Overall, do you feel like peer review is doing a good job? Do you think it needs to be reformed?

SIMINE: I think it's put on a pedestal that it doesn't deserve. It's absolutely crucial and should exist, and getting rid of it would be much, much worse. But it does about 10% of what people — including scientists — assume it does. So the problem is not so much the peer review, but that it can't possibly be enough. We can't stop at just pre-publication peer review; that's not nearly enough. But that's not a problem of peer review. It's kind of like expecting your hair dryer to make you coffee kind of thing. I do think we can improve it a lot also, though.

SPENCER: Another topic that gets discussed a lot is incentives for academics, and how there's just this incredible grueling pressure to publish, and if you don't get enough publications in those top journals, you're squeezed out of the field. Do you think that we need to reform in that aspect of the field as well?

SIMINE: Definitely. I feel like it's one of the things that people often say as if it's some deep truth that solves everything, "We need to reform the incentives." But that's just a restatement of the problem. I don't know how. Yeah, I'm not the best person to think about how to get traction on that. There's people like Brian Nosek and others who I think have much more vision about that. But it is a fundamental problem; although, I guess I will add that I think journals play a big role in that. They are a huge part of the incentives, and so part of my motivation for being a journal editor is that we can control and shape some of the tokens that are really, really valuable in the incentive structure. Even though I don't know how to change the structural stuff, there are some really obvious points of leverage, like editing a journal, that people care about.

SPENCER: As the expectations for science change, as we ask people to do better science, to pre-register, to do their work more carefully, one would expect that there will be fewer interesting findings, but hopefully they'll be more real, more robust, more reliable. Does that mean that we need to shift our expectations for what an academic does early in their career?

SIMINE: Absolutely, and later in their career, too. I think that we should shift from celebrating people who make tons of discoveries all the time and have so much to deliver to policymakers and everyone else, to being very, very skeptical of people whose careers are just, every month, they have a new thing to tell the public about. How can that be, especially in a field as hard as psychology? The little I know about some of the other fields, it's considered a big deal if someone, once or twice in their career, has something newsworthy or policy-worthy or whatever. And I think it really makes psychology look bad that we expect and celebrate that people have these kinds of breakthrough-level discoveries routinely. I think we're addicted to that feeling, and we need to stop.

SPENCER: As a mathematician, I love math. I love fancy analyses. But when I see a fancy analysis in a psych paper, it often makes me nervous. And one thing we found in our replication work is that sometimes they'll do a really fancy analysis without doing the obvious analysis. And they might even pre-register this, that they'll do the fancy analysis, but not do the simple analysis. So we came up with this idea we call the 'simplest valid analysis' where, regardless of what the original paper did, we always want to do the simplest valid way to analyze the data, just to see if that agrees with the fancy math they might be using. And I'm wondering, do you see this as an important tool, or not so much?

SIMINE: I can only speak for the areas that I'm most familiar with. In personality psych, I think absolutely; I think personality psychologists have the double-edged sword of being quite good at fancy stats, which can be really, really useful. There's some kinds of things, like with longitudinal data, where establishing causality is super tricky. It turns out that, whether you do a cross-lagged panel model or random intercept cross-lagged panel model matters a lot. I don't understand the difference between those, but I know that it matters a lot. But there's also many, many cases, more cases, where the fancy thing is not that much better than the simple thing, and skipping the simple thing is a red flag. So I think that, at least in that area, I would say absolutely, you should never skip over the really simple stuff, or hardly ever. But I think there might be other areas of psych where that might not be true, but I'm not sure.

SPENCER: Have you seen a culture change in the younger generation of researchers who've come of age during the replication crisis and during all this discussion?

SIMINE: Yeah, I think so. I have a really biased sample, so it's hard to say, but that's my impression.

SPENCER: Maybe that's a point of optimism that maybe the newer generation of researchers will just have this stuff more in their bones, like, "Hey, we need to make sure our research is more robust."

SIMINE: Yeah. And in some ways, I think it shows that it's not so hard, actually, to have the right intuitions, and it's consistent with my view when the scales fell from my eyes. I felt like I was going back to my undergrad level, first year grad student level, view of science, and it had been beaten out of me with all these, 'you have to be strategic, blah, blah,' not consciously or intentionally. But I think what I see in the next generation is a little bit more of holding on to those ideals. So it's not actually that they're learning new things, but that they're not unlearning the things that were there to begin with.

SPENCER: Last question for you: what do you see as the role of technology in improving psychology as a science?

SIMINE: I think there's a lot of potential, and I think that we're not great, psychologists at least. I think some areas of psychology might be better than others, but in the areas I'm familiar with, we're a little bit Luddites — or maybe I'm projecting because I am — and aren't taking full advantage. But we're still susceptible to some of the potential harms of the technology. So it's the worst of both worlds, but we hopefully will catch up.

SPENCER: Are there specific technologies on the horizon, or ones that people are beginning to use, that you think have a lot of promise?

SIMINE: I'm going to sound super naive here, because I can only answer at a high level but, from what I understand, I think generative AI will be useful for some aspects of peer review. There are a lot of aspects that don't require human judgment, or the human judgment can be added on, the human-in-the-loop model. But a lot of time can be saved by... I'm not even sure you need generative AI; just even simple algorithms might be able to do a lot of the stuff that we currently have to do through human labor, and then that would free up the human labor for the inherently — or, again, I'm going to sound naive — I think, so far inherently human aspects of critical thinking and judgment that I wouldn't want to outsource to a machine anytime soon, but I won't rule out ever.

SPENCER: Simine, thanks so much for coming on. This was a fascinating conversation.

SIMINE: Thanks for having me. It's been fun.

[outro]

JOSH:" A listener asks: what are your thoughts on tyranny of the majority and other tragedies of the commons?

SPENCER: Democracy has a great deal to be said for it. I certainly want to live under a democracy. Maybe one day humanity will invent an even better system of governance. But it's hard to point to better systems of governance that have been used at a large scale in modern society. But it does have its drawbacks. What if 51% of people decide to vote to screw over 49%? A democracy says, okay, they can do that, right? They just put in their votes and the majority wins. Or you get this even more pernicious issue where, if lots of people don't bother to vote, you could have 20% of people decide to screw over the rest of society, or screw over a large chunk of society. This seems like a really big fundamental flaw with democracy. How do you get around that? Well, one thing is the constitution in the US, I think, helps a lot with that, where you try to put some fundamental restrictions. There's limits to what the majority can do to the minority. Now, of course, in the US, we have a dark history where the majority has often screwed over the minority, so we have to be very wary that it hasn't worked in many cases. However, I think a democracy, together with a strong constitution, at least that's promising; at least, you can see how the Constitution could help protect everyone from a majority that's willing to cause harm.

Staff

Spencer Greenberg — Host / Director
Josh Castle — Producer
Ryan Kessler — Audio Engineer
Uri Bram — Factotum
WeAmplify — Transcriptionists

Music

Affiliates

Click here to return to the list of all episodes.

CLEARER THINKING

Episode 227: What do we know for sure about human psychology? (with Simine Vazire)

Contact Us