Episode 220: Should we trust papers published in top social science journals? (with Daniel Lakens)

Enjoying the episode? Want to listen later? Subscribe on any of these apps or stores to be notified when we release new episodes:

Listen on

Apple Podcasts

July 25, 2024

How much should we trust social science papers in top journals? How do we know a paper is trustworthy? Do large datasets mitigate p-hacking? Why doesn't psychology as a field seem to be working towards a grand unified theory? Why aren't more psychological theories written in math? Or are other scientific fields mathematicized to a fault? How do we make psychology cumulative? How can we create environments, especially in academia, that incentivize constructive criticism? Why isn't peer review pulling its weight in terms of catching errors and constructively criticizing papers? What kinds of problems simply can't be caught by peer review? Why is peer review saved for the very end of the publication process? What is "importance hacking"? On what bits of psychological knowledge is there consensus among researchers? When and why do adversarial collaborations fail? Is admission of error a skill that can be taught and learned? How can students be taught that p-hacking is problematic without causing them to over-correct into a failure to explore their problem space thoroughly and efficiently?

Daniel Lakens is an experimental psychologist working at the Human-Technology Interaction group at Eindhoven University of Technology. In addition to his empirical work in cognitive and social psychology, he works actively on improving research methods and statistical inferences, and has published on the importance of replication research, sequential analyses and equivalence testing, and frequentist statistics. Follow him on Twitter / X at @Lakens.

Further reading

Nullius in Verba (Daniel's podcast)

SPENCER: Daniel, welcome.

DANIEL: Hi.

SPENCER: People who used to read a lot of social science ten years ago, I think they generally felt that they could trust that science, and there were all these really interesting findings, and people were constantly learning about them. And now, a lot of people, their opinion has shifted, and they feel they can't necessarily trust what they read in social science. So my first question for you is, when you read a new paper coming out in a top journal (let's say in "Psychology") how much do you trust it?

DANIEL: Well, it depends greatly on the paper in question. Sometimes things are pretty good nowadays. We've learned a lot in a decade, and sometimes I still see some old practices that linger on. So maybe not everybody is completely up to date, but yeah, I would say it really greatly depends. So no overall judgment, but I'm an optimistic person in general, so I would say pretty good most of the time nowadays.

SPENCER: Okay, so you think we've seen real improvements. What are you looking for to help you decide if a new paper is trustworthy?

DANIEL: I take a look at methodological procedures that people have used to make sure that they can prove themselves wrong. So I take a look at whether they set up clear hypotheses that are falsifiable, if they made a good plan of data analysis. If there's a lot of data, that's nice. It's not necessary but the more data, the more accurate estimates are. So I would say, in general, these sorts of methodological components, but also theoretical aspects; is it a solid worked out theory, for example. Those things all combine in total judgment. It's a weird thing to do, but you sort of make an overall judgment of all these components.

SPENCER: Now, on the sample size point, when the study has a lot of data, you say it tends to be more reliable. Why is that? Can you break down the reasons for that?

DANIEL: What we've learned over the last decade is that, if you have small samples, then estimates are very variable, and that means that choices you make when you analyze the data move things around a little bit more. This can still happen with a large data set; it's not that it completely prevents the issue. But with small data sets, you really have a lot of variability in estimates. There's just not a lot of data so estimates move around a lot from every additional data point that you collect. In very large sample sizes or data sets, that doesn't happen anymore. So we have much higher accuracy and that's, in general, desirable if you can afford it.

SPENCER: We think about P-hacking, which is a set of techniques that people, either explicitly or implicitly by accident, end up getting false positives where maybe they throw away a data point because it seems like an outlier, and it changes their numbers, and now their numbers better fit their hypothesis, so they keep it. Or maybe they analyze the data five different ways, but then they only report the one that makes it look the way they want. Do you think that large data sets make P-hacking a lot harder, and that's part of why it becomes more reliable?

DANIEL: That's a very good question. Actually, recently, I was arguing with somebody that it is not necessarily the case that a large data set prevents this. It depends on what kind of flexibility people can afford themselves. So if they can really change their whole statistical model or the dependent variables that they enter, or things like covariates, basically the whole way that they try to ask a research question, if there's a lot of flexibility there, then you can still quite strongly get what you want and move data around, basically by changing your analysis in all the ways that you describe. If the theory is a little bit stronger and makes a little bit of a more constrained prediction, that becomes more difficult. So that's also why this theoretical component is often important to take it along. If it's a good theoretical prediction, you know that they didn't have that much space to move things around. But as I said, it's not completely solved by large data sets. If there's flexibility, people can still do this practice of P-hacking, and that will still happen. I think you'll still see it sometimes.

SPENCER: It's interesting that you mentioned theory as a way to make results more reliable, because one criticism of psychology as a field is that, whereas physics seems to be building something up, a theory that gets better and better over time, and more and more unified, and more and more generalizable, people have argued that psychology maybe isn't doing that so much. Maybe it has lots of micro theories but they don't necessarily get rolled up into a mega theory. What do you think of that critique?

DANIEL: I think that is, in many fields, quite solid. That's a solid criticism of what we're doing, and it's a very undesirable situation. Now, some fields maybe do this a little bit better. For example, in cognitive psychology, there are quite a lot of different models of things, like working memory, for example. They're not infinite, and people there actually sometimes even get together and discuss them among each other to try to figure out how we can exclude different theories and test them against each other. But it's completely true that many fields focus much more on effects and less on accumulation of knowledge that builds into an overarching theory. That's definitely a solid criticism, and in some way, we're trying to fix methods first because it's easier; fixing theories is more difficult. So I think that's why you've seen more focus on the methods, and now hopefully we're in a place where you can also fix those theories.

SPENCER: One challenge, I think, with theories in psychology is that they're often not written in precise mathematics; whereas, a theory in physics, it's literally an equation. Everyone can look at the equation and see exactly what it predicts. A theory in psychology — at least most of the ones I've seen — are written in words, and in words, there's actually a fundamental ambiguity. So if you go do an experiment, can you really say you've refuted the theory? And in addition to the ambiguity, we wouldn't expect a psychological theory to hold 100% of the time in all cases. Maybe a theory in physics, like if you isolate enough variables, you should be able to say every single time you do that experiment, you should get the same result, or 99.9999% of the time; whereas, in psychology, it's like, of course not; of course, any theory is going to have its exceptions. So I'm wondering, do you see it as feasible to actually refute theories in psychology, or do you think that these ambiguities and lack of precision make it really hard?

DANIEL: Well, I'm hopeful that we can get pretty far here, but it's more difficult. One of the quotes that I like to say about this topic is from Einstein, who says studying physics is child's play, but studying child play on the other hand... The moment that there's free will, things just become really difficult. Now that doesn't excuse us with our verbal theories. We can do better, and sometimes we do. We have, for example, very good theories of reinforcement learning. So that is the way that people are rewarded for certain behavior, and then they start to do it more. And I think everybody who has a weird app on their iPhone or their smartphone that they open every morning just because they get their daily reward is basically a victim of this kind of model, and we can model this quite well in mathematical theories. But the majority of stuff, you're right; it's this verbal kind of theory. A nice anecdote here is that, when I used to teach a wide psychology course for all sorts of people at our technical university, there were also physics majors. I would teach them about cognitive dissonance theory, which is really more of an effect than a theory. And one person came up to me after a lecture and said, "You keep using this word 'theory,' but I don't think it means what I think it means." And we use this word in a very different way than people in physics. Completely true.

SPENCER: That's funny. But there's also this opposite concern, which is 'over-mathematicalizing' something. This has been a concern in economics where, nowadays, as I understand it, if you do a PhD in economics, there's just a tremendous amount of math. And some of the old school economists are like, "Well, actually, some of the best theories of economics are not these super mathematical ones. They're more like understanding the deep underlying principles." So I'm wondering, do you think that the best version of psychology as a field would have really precise mathematical theories, or is that just not the right tool for a bunch of the stuff that psychology is trying to understand?

DANIEL: I think that's a very nice point. That's a very good thing we should ask ourselves. So our goal is not to make mathematical theories or quantitative models of things. It's lovely if we can do it, but it would also be a shame if we limit our field of study to the things that we can nicely model. Actually, yesterday we had a course on theory building. Noah van Dongen, a colleague of mine, gave it. We have a graduate school. And a person there said, "Yeah, we are very good at explaining the Stroop effect," which is a classic in psychology. And he's like, "We have very good models of this. But actually it doesn't really do much outside of these models and of these experiments. So in the real world, it doesn't do much." So yeah, we also want to, of course, explain things that matter in the world, and maybe sometimes those are — at least for now — way too complex to capture in these more formal models, but then we should still study them as well as we can. So I think it's a nice point. We shouldn't make it a goal in itself.

SPENCER: And the Stroop effect, if I'm remembering correctly, is when you show words that are color words, like the word 'red,' but it might be shown in the color blue. The way people perceive it, they might have trouble reading the word if it's in a different color and vice versa. Is that right?

DANIEL: Yeah, they have to read the color but the word itself also has a semantic meaning, and that's also a color meaning. So you can have red, but it's written in blue, and then you have to say blue, but it's difficult because the other color is conflicting, or the meaning of the word is conflicting. A fun fact here is maybe that people used this technique during the Second World War to identify spies. So if you do this in Russian, for example, and you can read Russian, if you just have to name the color of the words, and you don't know Russian, then if those are color words that are conflicting, it wouldn't slow you down. But if it slows you down, then you're probably a Russian spy. So that's a very stable finding in our field, and we actually also have good theories about it. Well, this is a nice practical example. But beyond this, sometimes it's difficult to apply it for anything. So it's not the only thing that you want.

SPENCER: And that goes to the idea of these micro theories, a micro theory of how the Stroop effect works, and maybe that theory is correct, but what the heck does it have to do with the rest of human psychology? I'm curious to hear your thoughts on how we make psychology a cumulative science, not just the science of micro theories, although that's obviously a big question; it's probably a difficult one to answer.

DANIEL: Well, I think it's probably very important that we get together a bit more across research fields and lines, because this integration happens across research areas. Everybody in a certain small domain needs to start to talk to other people, but this process is very labor intensive and difficult to do, and generally not rewarded. Eventually, it would even have to go and maybe even do interdisciplinary research, because a lot of the problems we're trying to solve are actually pretty big problems, where we would need sociologists and economists, and the theories would require expertise from all those viewpoints. The difficulty is that this is very labor intensive, not rewarded in the current academic system. So that's the first thing we probably want to solve, that this is actually more rewarded than it is currently.

SPENCER: You mentioned a couple different theories, like Stroop effect and reinforcement learning. Do you have an example you could point to where you feel like this has gone well in psychology, that it's about something maybe a little bit less narrow and technical, where you feel like a kind of grander theory has been developed that actually has made progress?

DANIEL: An example that I think we've done pretty well is social identification theory. This is just social group processes, which involve processes like group identification, but also polarization between different groups. And here, I think, social psychology especially has done a pretty good job developing pretty basic, widely applicable models for how social groups tend to polarize or not. So that's an example of something that's quite relevant, I think.

SPENCER: Does that involve some of those experiments where they'll kind of get the minimal group identification, where they find that, by assigning someone some completely arbitrary label, people will start affiliating with people with the same label and so on?

DANIEL: Yeah, exactly. They have experiments where people are classified as a Kandinksy lover or a Paul Klee lover because of some art that they prefer, and that doesn't really mean anything, but then you find those effects. Now this is a very experimental paradigm. This is, of course, based on a lot of social processes that happen, like real polarization between groups in society. So luckily, this is something that also nicely generalizes two real world contexts and is also studied in there, yeah.

SPENCER: Now let's talk about: how do we do better? I'd like to start talking about how we do better as individuals, as project leaders, et cetera. And then we can talk about how we make science better broadly. But where would you start? Let's say someone's doing research, whether it's psychology research or really any type of research. How do you think about getting better at that craft in a way that helps avoid many of the problems that we've seen in social psych?

DANIEL: I think one big thing that I would like to focus on, that I think we need to do a little bit more, is focus on making sure that we get decent criticism on our ideas. Ideally, we might be critical of our own ideas. I think that is the classic idea. But even people who strongly believe that if you want to test your ideas, you should try to actively falsify them. So a philosopher like Karl Popper, even he would admit that people are not really good at trying to criticize themselves. Even he would say, "Well, maybe you need to get other people to try to criticize you." That is, I think, a very good starting point where you surround yourself with people with a decent critical attitude, and you create an atmosphere where people feel comfortable criticizing each other. And that's actually surprisingly often lacking, even in science, I would say.

SPENCER: I was on the board of a nonprofit organization that does research, and the research had gotten criticized for not being rigorous enough. And one thing I proposed to the organization is that, for every new piece of research, there's another person at the organization who's not the one doing the research, who's assigned the role of the chief criticizer. And the chief criticizer is responsible for what happens when the research is published. So if the research comes out and it gets attacked and lots of errors are found, it's actually the chief criticizer who's responsible for that and they're the one who takes the blame. To try to create this role that gets around the social biases of like, "Oh yeah, your work's great and everything's wonderful," it's trying to make it so that there's actually really skin in the game for criticizing.

DANIEL: Yeah, I like it. It's a devil's advocate position, but then also with real buy-in because, if you mess it up, then you are responsible for it. I think that's an excellent way to create a system where criticism is voiced, and everybody understands it's important, and they're motivated to do it. And it's interesting that people often say, "Oh, lovely idea," and they think that being supportive is really important, and it is, but actually criticizing somebody can be a great act of support because, if somebody is going in the wrong path, you need to point it out. So that's one thing. I've worked a little bit on doing this, but in a larger team. So we have this idea of red teaming, which comes from computer programming where people create software and they want software to not be hacked, for example. So you have a blue team that makes the software and there's a red team that criticizes. This is basically something you can do in many environments. You create a team that tries to actively break down what the other group is working on, but you're doing this in a collaborative environment, to help the other group. That's another way you could arrange it.

SPENCER: Yeah, I wonder if, in programming, this is more socially acceptable because there's something a bit more objective about it where, if you put out some code and then it gets hacked, everyone agrees that's terrible. You know that you want to find the bugs in it before it goes out and gets published in the world. And so the red teamer is doing you a favor by finding the bugs. And maybe also there's less stigma because every programmer knows that they constantly introduce bugs. Even the best programmers are introducing bugs; it's just a question of how good you are at finding them. Okay, maybe some programmers introduce bugs at less frequency, but everyone introduces bugs, so it's kind of normalized, and then it's like, "Let's make sure we have really robust procedures to find those bugs; whereas, when you're a scientist or you're working in other fields, maybe there's more ego tied up with there being flaws in the work. I'm curious what you think about that.

DANIEL: I think that's a very important observation, and it's definitely true that, in science — which is my background, of course, but I think this is true in many environments — we create an environment where people have to look smart all the time, and it starts very early on as a student, but actually just keeps going. People feel it's a bad thing to be incompetent, to not know something. For example, I remember when I was a PhD student and my supervisor would tell me about something, I would just always pretend that I knew what it was about. I didn't want to say, "I don't have any clue what this is about." That creates a weird atmosphere where you're not allowed, or you don't feel comfortable, to say that you don't know things, let alone when you make mistakes, of course. Now I don't think that programmers make more mistakes or put more bugs in code than scientists make mistakes when they code — we're probably worse coders — or when we reason about things. It's really more of this environment which you need to stimulate and make it more acceptable that we all make mistakes by discussing them, for example, by making things more transparent so you can't hide them as easily. And that's also something we see now more and more. People are sharing code, which is often scary for a lot of people, because if you make a mistake, it's visible. But it might actually be good for the field that we just see that this is also part of how we work, just like it is for coders.

SPENCER: It's interesting you mention coding because one thing I found shocking in science is how many PhD students are completely self-taught in coding. I came from a programming background, and the idea of someone becoming a really good programmer, starting from scratch and being self-taught, is really hard to do. Software engineering is this entire discipline that takes years of training. So I would just predict a priori that scientific code would be full of problems, full of unfound bugs, sloppy code, code that's really hard to understand. If another scientist had to read it, they'd have no idea what's going on. And I'm wondering, how big a problem do you think that actually is?

DANIEL: Well, you're talking about code, but why would it be different when we develop our own measures? That's also an expertise. Everybody's expected to do what we learned a little bit. Or about our statistics, it's exactly the same. I think the weird thing about being a scientist is, indeed, we are expected to learn all these things ourselves; whereas, if you're in a professional environment, somebody is specialized in one specific thing, and we acknowledge that this is a very difficult thing to learn that takes a lot of time, and we don't let computer programmers also become very good writers, or very good teachers, or very good theoreticians, or very good all-of-these-other-tasks. So one of the things I feel is a bit of a problem in academia at the moment is people have to do all these tasks, and we don't honestly say we're not giving you enough time to become an excellent programmer, so you're likely gonna make mistakes. You're completely right; it's a very peculiar system. And it seems logical to me that we indeed make more mistakes. We're not experts. We can learn a lot from experts, but maybe we should just have more specialization within science. It would be nice.

SPENCER: Now, one thing that people might bring up here is, isn't peer review essentially a form of red teaming? I mean, doesn't this already exist? So why are we talking about this problem?

DANIEL: Well, peer review is as good as the peers; it really depends on the quality of the peers. And I think that this is a similar specialization issue. What you often see in peer review is that people will say, "Well, I read the paper and I thought it was very interesting, but I really can't say anything about the topic of the statistical analysis because that's not my expertise," for example. So we have a system where people who are the peers are also limited in certain things, so they will miss things. I think if you do a red team — and we've done this awhile ago for a paper — where we hired red teamers that had very varied experience, that you would otherwise normally never have as peer reviewers on a project like this. For example, this project that we did and it's published now, was a quantitative meta analysis about gender discrimination. But as one of the red teamers, we hired a qualitative gender researcher. So this person would otherwise normally never comment on a project like this, but they provided extremely useful input. By having greater expertise present in this red team, you can, of course, prevent more mistakes. Now it's also much more effortful, much more costly. We're basically expanding the number of peer reviewers. In a way, you could say, if we just had six very diverse peer reviewers in every paper, we would have it as well. But in practice, that is not how peer review currently works. Of course, you can't do it for every paper maybe, but for really important papers, I think we might need to be a little bit more strict in the future.

SPENCER: That suggests that the problem is not so much with peer review per se; it's just that peer review is going to have a more limited set of reviewers and a limited set of expertise. But one thing that struck me about peer review is that people are doing it for free as an extra duty on top of their other duties. Obviously, they want to do a good job, but they're not going to spend a tremendous amount of time on it. And my suspicion is that a lot of peer reviewers — perhaps the vast majority — are not getting to the level of detail that you would need to see some kinds of problems. For example, are they reading the pre-registration plan and comparing it to the actual result? Are they taking a look at the data that was provided — if the data was provided at all — and poking at it? I think probably not.

DANIEL: You're right. That often doesn't happen. I have to say that I'm a bit obsessive sometimes as a reviewer, and I spend quite a lot of time on it; I take it very seriously. As a consequence, I write very long reviews, but I also actually get quite nice compliments from editors, where they say that I, indeed, especially spend a large amount of time and do a lot of stuff. So that makes me think maybe I should do a little bit less, because clearly most other people are doing a little bit less. The question is, why do I do this? Why would you invest a lot of time? In your system that you proposed before, where somebody is held responsible if there's a mistake, that is kind of how I feel. When something is reviewed and I was a reviewer, I feel responsible if there's still a mistake in there. I think many people don't have this. There's really no consequence for me personally, if somebody else would find a mistake, and I was one of the reviewers. It's just that I would feel bad about missing something. Yeah, peer review can be anonymous, but I almost always — maybe always, but I'm not completely sure — sign my review. You can voluntarily sign, and that's something that I do, because I think it's better than an anonymous system personally. But you don't have to; it can be anonymous and then, of course, nobody knows how well you did your job.

SPENCER: My PhD was in applied math with a specialty in machine learning, but in that world, typically, the person being reviewed doesn't know who the reviewers are. The reviewers are totally anonymous. And the people who read the paper don't know who the reviewers are, but the reviewers know who submitted the paper which, to me, strikes me as completely backwards, that, in fact, you want the reviewers to have their names on it, so if there's embarrassing mistakes, the reviewers have skin in the game. And you don't want to know who submitted the paper. Okay, maybe some people can figure it out because of the topic or whatever, but you really don't want them to know who submitted it, because you don't want all the social biases of like, "Oh, well, I don't want this person to be mad at me," or, "This person seems impressive, and so I better write a nice review," etc. What do you think about that?

DANIEL: I think we're back at the similar problem as before, that this sounds like a very good system to me, and it's something that you might want to try to achieve. Now, the problem is that this requires real buy-in from the people who become the reviewers, and most of the reviewers just wouldn't do it. They don't want to be held responsible if they make a mistake or something. That's why most people don't sign their name. It depends a bit, but many people don't give their name. And I think that journals would worry that, if they would require this, then they won't get a lot of reviewers anymore. And this is all part of the same problem that we discussed before where criticism — and that also means being criticized as being a bad reviewer — people just don't want it, basically. We're not open for this criticism. So yeah, that's a core of the problem, I guess, somewhere, yeah.

SPENCER: So why do people review papers at all? Their name's not even attached to it most of the time. Okay, you can sign up to have your name attached, but most people don't do that. What is the actual motivation? Is it altruistic motivation to help improve science? Is it to impress the editor? Is it because someone requires them to do it? I remember as a PhD student, my advisor would be like, "Hey, will you review this paper?" And I felt like, okay, my advisor asked me to do something; I should do it. It was sort of a sense of obligation to my advisor. But, yeah, I'm curious why people generally review papers.

DANIEL: I know a little bit about the literature. I'm not an expert, so somebody else might add a couple of reasons, but one is just to keep up to date with state-of-the-art stuff. You'll be the first to read new papers, so that can be good motivation. Another can be a sense of responsibility. There are people like myself, for example; one of the reasons that I review is because I think you have to keep the system going if you care about peer review — and I do because it often improves my papers — I think it's important to get criticized. But then you have to submit as many reviews as you get; you have to keep the system going. Whenever I submit a paper, I get two reviews, so that means I also have to do two reviews. It works out a little bit different if we have multiple authors, but something like this. You have to keep the system going.

SPENCER: But that's just self-imposed, right? Nobody's imposing that on you.

DANIEL: No, no. There's nobody who keeps track and says, "Hey, Daniel, you've submitted so many papers, you have to do three more by the end of this year." No, no, that's self-imposed, and you can really get away with not reviewing at all if you just decline all reviews, and nobody would know, and it would save you time. So in a way, you're really a bit of a sucker; I'm a bit of a sucker for doing these reviews. It's not helping my career very much.

SPENCER: It's kind of remarkable that the system functions at all with this kind of opt-in system.

DANIEL: In a way, it's maybe positive. So many people actually just really seem to care about peer review and try to do a good job. And I have to say that I really get a lot of pretty nice reviews of work that I do. So I'm pretty happy with reviewers almost all the time; I would say maybe 95% of the time. But it's kind of amazing that it works as it does. And I think there are big differences between fields and, again, the quality is sometimes a bit of an issue. That is just still true.

SPENCER: Peer review gets a lot of flack. People often criticize the peer review system, and I think partly that's because bad papers get through peer review. If you think about the replication crisis: tons and tons of papers that don't replicate got through peer review. My view — and I'm curious if you agree with this — is that peer review catches lots of types of issues. It just doesn't catch certain other types of issues. And so if we were to remove the system, I would expect it would actually greatly reduce the quality of what's published. Because I think the peer review will generally catch a really low-quality paper; it's not going to make it through. I think what it's generally not so good for, though, are the kind of subtle ways the paper is really bad, whether it's P-hacked or other things that are not that obvious, unless you kind of really deconstruct the paper carefully and realize what it's doing. Or maybe messed-up statistics, but that would require expertise to realize they're messed up and so on. What do you think about that?

DANIEL: Yeah, I think you're right. It definitely filters out problematic papers, and there are just a bunch of those, for all sorts of reasons. It might be that people don't get enough supervision and they're young and they're trying to get published, but they just miss a lot of literature. So somebody needs to say, "Sorry, but you really need to go back and read all this stuff and incorporate it." Or people make mistakes where somebody has more expertise. I'm often a statistics reviewer for papers. People ask me, "Can you take a look?" And then I often have to point out, "Look, this is just not state-of-the-art. There's stuff that we've learned in the last 50 years, and now you're supposed to do it like this," actually, if you want to do a good job. And the more expertise you need, the less often you will have a peer reviewer that has this expertise, and more of those things go through. Actually, a lot of innovations or improvements over time, if scientists didn't learn about those, they will keep just going through peer review. So that is indeed a problem. The big mistakes people filter out, and that is nice in itself. And if you get an expert, you get a good suggestion for improvement. But peer review is as good as the peers. So if the peers don't know about a certain technique or flaw, then it just gets through.

SPENCER: Another thing I'd be curious to get your feedback on: we have this project, Transparent Replications, where we're replicating new papers coming out in "Psychology" and top journals. One of the really interesting preliminary findings we have is that we expected to see lots of P-hacking, so lots of failures to replicate. And in our first eight replications, we've only had one really significant failure to replicate, and even that one, I don't think it was P-hacked. I think that it failed to replicate for other subtler reasons related to confounders. And so it's kind of shocking because, if you ask me to estimate, what's the chance that a paper won't replicate — based on these big replication studies that have been done, where they'll take 100 papers and try to replicate them — I would have said something like 40% wouldn't replicate. And yet, we're seeing almost no failures to replicate. Yet, if I look at these papers and say, "Well, do I think these are papers that are really, really good?" some of them, I do, but a bunch of them, I don't think are very good. And the reasons I think they're not very good are not due to P- hacking. And so I've been thinking a lot about this, and trying to think about how to describe the way that they seem to be not that good. And the term that we came up with for this is what we call 'importance hacking.' It's kind of an analogy to P-hacking. [Daniel laughs] The basic gist of importance hacking is that, while the paper would replicate. If you redid the exact experiment on new participants, and redid the analysis they did, you'd get the same result. But the result doesn't mean what they said that it meant, or it doesn't have the value that they claim that it had. And so an example of this would be, maybe there's some new, cutting-edge innovation in how to increase education. But it turns out, if you look really carefully at the result, it turns out it barely moves educational outcomes at all. And the only reason they got a statistically significant result is because they use some crazy large sample size. So it looks like a new innovation, but really it's negligible. That would be an example of importance hacking. Another example would be the result looks really pretty and this really nice takeaway, but if you really dig into the details, you realize it's actually just a messy mush, and it's really not clear what the conclusion to draw is. But you kind of get led through the paper to come to a certain conclusion that sounds really nice. My thinking is that, if you're a scientist and you want to publish, obviously, what everyone wants to do is to have a really valuable finding. That's what everyone tries to do. But okay, let's say you're not going to do that, or you can't do that, or it's too hard because you're trying, and you need to publish, and you don't have all these amazing new, valuable results, well, what do you do? Well, you could P-hack. That's one option, and there was a lot of that going on. Another thing you could do is, you could commit fraud. Very few people are willing to do that, thankfully. But the other thing you could do is, you can importance hack. You could make it seem like you found something really valuable and significant when, in fact, it's a real thing — you could get the same result if you did it again — but it doesn't really have this value or significance that you make it out to have. I said a lot of things, but I'm curious to hear your reaction to that.

DANIEL: Well, yeah, I very much like what you're saying, and I think that is a good observation, and I had a similar observation a while ago. I'll share it. But first, I think it is maybe not surprising that there's less P-hacking. It has gotten so much attention, and I teach a lot of statistics courses, and I teach actually often about ways in which you can have more flexibility than people think they can have. For example, I'll teach about sequential analysis, where you can collect some data, look at it; you can decide if you want to collect more data, and if you do, you can look at it. But you're controlling your type one error rates. So there's nothing wrong with this practice. Now, if you do it badly, we call it optional stopping and it is P-hacking. But I tried to explain to people, "Look, you can have this flexibility. It is not bad. It's called sequential analysis. You can do it, do it well." Now, often we have to teach people that they can be more flexible. They're so scared that they will P-hack, that they don't want to do any of this anymore. So yes, let's say people really have this feeling of, "Okay, if I get caught P-hacking, that is so problematic, I really don't need to do this." They still want to publish. And what you describe is an interesting example. I saw a different example. I was emailing with somebody, and I said what I think is going to happen — this is a prediction for the future, but maybe you've already confirmed it to a certain extent — is that people are going to do experiments. They will have larger sample sizes. They will get relatively smaller effects, exactly as you say. And these smaller effects are often actually maybe based on some sort of confound. So there's something in their paradigm, something in the way that they do their study. There is a true effect, but it's not really because of what they are claiming it is about. Some confound in the method, something about (I don't know) different attention, or whatever kind of mechanism causes this small, significant effect again and again and again. And now, they have these very large samples, they have this consistent effect, and anybody who wants to criticize this will have to dig in the method and discover this confound, and then do it in a couple of very large studies only to say, "Hey, there is one confound in this other study." I think people might not even do it. So I really have a bit of a worry that, yeah, we'll have a lot of replicable effects, but they're due to other things that we think about. But nobody's very strongly motivated to figure this out, because we've also increased sample sizes so much that it's difficult to do this. This is a bit more of a pessimistic outlook, but I think, yeah, what you identify fits nicely in the same kind of category of, 'it will replicate, but it is not what we're interested in.' It's a concern, a new concern. I don't hear anybody talking about it so far except you just now, but I had the same email conversation with somebody. So I think it's something we need to prepare for.

SPENCER: When I think about this problem, I think of this metaphor of a pipe. Imagine you have a pipe that has three leaks in it, and then you go patch up one leak. What happens? Well, you just get more water pressure streaming out of the other two leaks. And then you patch the second leak, and then you get even more water pressure through the third leak. And this is what worries me because, suppose the standard for what it takes to get tenure is the same; you have to get a certain number of publications to top journals, and we know there was tons of P-hacking happening before that were helping people. Not everyone was doing P-hacking, but there was a lot of it going on, and that was helping people get this number of publications. Okay, so we make P-hacking a lot harder. What happens? Do people just magically start producing lots of new valuable research, or do they find other ways to get published that are easier than that? And my concern is that it's really, really hard to do really good science and discover something new and valuable. That's just fundamentally hard. So how do we get to a place where people are publishing the same number of results without some other kind of shenanigans going on?

DANIEL: Well, to continue your analogy, I think this solution is also clear. We have to reduce the pressure and there are interesting developments here. I'm from the Netherlands, and we have a labor agreement in academia. We only have a handful of universities. We're not such a big country, 18 million people or something. We have maybe 13 universities. We also have higher education like universities, but I'm just talking about universities, maybe 13 or so. So they can get together, they can create a labor agreement, reach agreement about how we hire people, for example. And recently they changed tenure criteria. Now, if you're hired as an assistant professor, after one year, you'll basically hear if you get tenure or not. That's a ridiculously short amount of time. So you don't really need to worry about publishing a lot in that one year that you're hired because that's not going to happen anyway. Nobody can really expect this. So all of a sudden, you've reduced some of the pressure, and then you very early on, give people tenure. It's such an interesting experiment because you're removing a lot of the pressure off of people. So what is going to happen? Well, maybe they won't publish at all, but maybe they'll think, "Oh, wait, I actually have a lot of freedom to now do stuff that's maybe more difficult and more important, and maybe I don't need a bunch of single-authored papers about something." And one of the things I would love to see is that we get together a little bit more and say, "How are we gonna do this?" I don't really want a science where somebody publishes something, there is a leak, they find a sneaky way to publish a bunch of stuff and they're happy with it. No, I want people to get together and say, "Yeah, we could do it like this. But look, there's a confound over there, so let's not do it because it's just not the best experiment we can do." So a bit more coordination throughout this process would also be good. So first you reduce the pressure. That's probably still not enough. But then you get together and you say, "Hey, how are we going to do this?" A bit more collaboration, that would also be very good. So if you want my recommendations, I would say those two things are probably pretty, pretty important.

SPENCER: You mentioned coordination which seems really important in science. Where do you see this coming into play, and how do you think we can coordinate scientists better?

DANIEL: There are actually philosophers of science who sort of jokingly talk about this. They call it like a philosopher monarch or something. Kitscher is a philosopher who talks about this, and he also says it's probably not going to happen. What you have in other fields is something called consensus meetings. So people from a field come together and say, "Hey, what do we actually know and what don't we know? But what do we all agree we should know? Well, how about some of us go do this? What other things do we need? Oh, we need a good measurement tool for this. Okay, so a couple of other people are gonna do this." It exists. So when you talk about this, people are sometimes surprised. But if you talk about physics, for example, they have a meeting once a decade where they basically say, "Okay, you know what? North America, you're gonna do this, and Japan, you're gonna do (like) magnetism. And in Europe, you're gonna do particles." They have to, because they need crazy expensive machines. But in some fields, there is a system in place where people get together and coordinate. So what's necessary? Well, they need to because of money and limited resources; that is one way that it emerges in other fields. Or they collectively think, "We really need to (for example) save patients and make their lives better," and they really care about improving things in practice. And then they say, "Let's get together and do one of these consensus meetings in our own discipline." So those are some examples I would love to see in other fields as well.

SPENCER: What are some things that you'd like to see social scientists try to reach consensus or coordination on?

DANIEL: Well, I think that's simple, actually. I would say first which things need to be replicated. So this is a task, a chore that nobody really wants to do, but we all agree we need a robust basis, especially if there are findings we're building on and we're uncertain about them. That is one of these things. Another thing that I would see many fields do is get together to develop measures. There's actually a big diversity of measures. Recently, some other papers were published about this former collaborator of mine, Farid Anvari, for example, just uploaded a preprint showing there's a huge variability of measures being used in psychology, and it's very undesirable. What you see in these other fields where they have these consensus meetings is they get together and they say, "Which measure are we all going to use in the near future?" Now I also want to point out there's, of course, a risk, if you have a lot of consensus and a lot of collaboration, that you don't have any diversity anymore. That's not really my end goal. You can even build in a modest amount of desired variability in measures. But you don't want to go all the way; now, it's just too crazy. So those are two things, replication measures, and then I think some important, difficult questions that maybe people wouldn't study because it takes a decade to study, and everybody's like, "Are we really going to commit to this?" Well, we could also get together and say, "These difficult things, part of our time we're going to spend on these difficult things. Otherwise, nobody's going to do them, but we all agree they're worth doing." So maybe those three categories. I'm sure other people could come up with more, but that just seems like a good starting point.

SPENCER: That last category, that's like the Large Hadron Collider for physics, right? These massive undertakings where your scientists will work on it for a decade but the whole field will hopefully benefit as a result. What is the Large Hadron Collider for psychology? [laughs] What are some projects?

DANIEL: Well, maybe it's those bigger theories you've been talking about. If you really want to build one of those bigger theories, you can say, "Okay, we actually first need this and this and this, and then we need to figure out this component and then this component." One reason those are not built is because they require integration of all sorts of small knowledge units that are themselves quite difficult to do. And if nobody sits down and says, "In a decade, we want to be there," then maybe that's not done. So the Large Hadron Collider, we don't have one topic, a downside of psychology — it also makes it interesting — is that there's so much stuff to study. Human behavior is just so varied and things change over time. So there are new things to study: the impact of social media use or something. It didn't exist two decades ago, so nobody was studying it, but now we do. There are all these different topics, but within those it would be nice if we work towards some theories of the more important things that we study.

SPENCER: Coordination also seems relevant to the topic we were discussing earlier, red teaming. Where do you see red teaming as coming into play? Because you mentioned that you were doing this study and you actually hired red teamers, which is really unusual. But if red teaming is going to become something that's standard, presumably there has to be some agreement of when it occurs and who does it, and so on.

DANIEL: Yeah, you can only have that if you coordinate. If somebody decides, "Hey, we're going to do it," and everybody signs up, okay. Because if you're inviting this criticism, you shouldn't just invite the criticism; you should also commit to incorporating it. So it really requires coordination of all people involved: "You'll do the criticism. We are going to say that we will incorporate as much of it as we can." And together, we believe that we're working towards a better paper. That doesn't emerge naturally. Now, you can do it in a small group, that's fine, but you often want to do these things more systematically. I see the role of criticism actually being part of coordination. And again, sometimes, when you talk about coordination, it feels like I'm proposing some sort of kumbaya system where everybody agrees with each other, everybody wants to do the same thing. Not at all. Just like you discussed, you assign one person to be the critic. In coordination, you can have a whole team that does something else. I remember that in CERN, for example, now they have a plan to build a larger collider, another circle — a big circle — but there was a sub-team that had the task to think if they could build a straight collider so you don't have to go around the turn, but it would just be one very long tunnel. It didn't happen. And when I was there, most people said, "Yeah, this is not going to happen," but they still invested in a group of people to explore this possibility. And so you don't have to work together on the same thing. You can also coordinate disagreement and variability, and I think that is actually also very valuable.

SPENCER: So in an ideal form of social science, where do you think the red teaming would come into play? Who would actually be paying for it, and when would it kick in?

DANIEL: Well, it would be nice if people can volunteer for it. I have the feeling that there are a bunch of people in academia who already like to be the critics. Maybe they'll say, "Don't let me do new stuff. That's not where my strength is. That's not what my passion is. What I love is to find mistakes in the work of other people." Nowadays, these people have to do this during the peer review process, when it's already too late. We have currently built in criticism at a very unfortunate point in the research cycle, namely, where somebody already collected all the data, they wrote up their entire paper, then we send it off to some journal, and the journal finds peer reviewers. Well, if you would think about how to best organize this system, this is not the moment to criticize people. Of course, they're going to be super defensive because you're standing in the way between the data they already have and the publication they want. But if you move this criticism process forward, all parties would be a lot happier. Now this happens, for example, in registered reports. This is this publication format where you decide to submit your research proposal — your method, your introduction, your analysis plan — but you have not collected the data yet, and you submit this for peer review. The results are not in but peers can criticize you. They can say, "Hey, this is all nice, but this is a fatal flaw. You need to fix this thing." And then you can fix it before it's too late. The same with the red teamers. They come in wherever you need them. That's also a nice thing about it. Whatever you need to have criticized, they're right there, but they would start much earlier, before data collection, and help you prevent mistakes. And everybody is happier. I think even the critics are much happier if they can criticize at a moment where it still matters. At least in my experience, when I'm a peer reviewer of a registered report, I'm like, "Oh, thank God, I can tell you at this moment you shouldn't do this wrong because, otherwise, it would have been such a waste."

SPENCER: It reminds me of public debates where, if you have a televised debate. At that point, all the social incentives say you shouldn't back down and you shouldn't say that the other person was right; whereas, if you could have that conversation privately beforehand, then people would have the flexibility to change their mind and not be embarrassed by it. If the red teaming happens early enough, then they're just doing you a favor. They're preventing you from being humiliated later if your results come out the wrong way.

DANIEL: But we do the same thing with criticism of published papers. We also do this through this weird format of commentary articles, and then everybody gets so super defensive, because you write a long letter, you criticize somebody, you send it to the journal. You want to get your criticism published, the other side feels attacked. They can't really communicate with you. That's such a weird way to resolve a conflict, a disagreement. I would be really in favor of just inviting all these people to the local pub, you just sit there for seven hours or something with a beer, you have a good conversation. I think that would be much more productive than the average criticism we now have in scientific journals, of a back and forth of commentaries.

SPENCER: One technique that I think is really powerful, that we've started to see a little bit more of, is adversarial collaborations, where you'll get scientists that strongly disagree. Instead of just debating it endlessly in the academic literature — one writes one paper, one writes a response, and it seems like nothing ever is concluded, and maybe people take sides — the idea is that they actually work together on a paper where they agree on the methodology, and then they co-publish, and they make it clear through the studies, what they ultimately concluded. Did they end up agreeing in the end, or did they disagree? And why did they disagree? And I just feel like this is actually a much better way to resolve disagreement than trading off papers.

DANIEL: You would say so, because I think it fits your disposition. You want to have a nice conflict and then resolve it. I agree, it's a lovely initiative. What you actually see in practice with these adversarial collaborations is that they often fail. It seems to be quite difficult, especially if you have top experts in the field who've really dedicated a decade of their lives on certain theories, but there's conflict between them. You get those people together; that doesn't seem to work so very well. You still have to agree on resolving your issues in some way. There has to be some good faith, and I think that's sometimes missing. Otherwise, I love these formats but I think, in practice, we need to make sure that the people who are involved in them are willing to actually reach agreement, because some people seem to not do. It's an interesting observation, just that they often fail. Otherwise, they're a lovely format, and they do the same thing in an ideal system.

SPENCER: That's really interesting. Can you tell me more about the failures? I know of one that failed where it was people who were researching (I think it was) psychic powers, actually. And the two authors basically wrote this paper where they essentially accused each other of manipulating the data, where one, every time they did the study, they didn't find psychic powers, and the other, every time they ran it, they did find psychic powers. I was like, "What the heck is going on?" That was, I think, a clear failure. On the other end of the spectrum, there's Kahneman's relatively recent adversarial collaboration about happiness research, where they really actually worked together to figure out why their research disagreed and came to a nice conclusion together. So that seems like a really nice case study. But what have you seen in terms of them failing?

DANIEL: This information that I have comes from one journal that exists and that actually publishes adversarial collaborations. It's relatively unknown. It's published by the Association for Psychological Science. The journal is called "Psychological Science in the Public Interest," and it only publishes three papers a year; that's it. So maybe you've not seen the journal because it doesn't publish a lot, but they, I think, invite or organize experts and get teams together. Instead of inviting one person to write a paper about a topic they're an expert on, what this journal basically does is — as I understand it — invite two different groups of people who have been debating things, and say, "Come together. Work it out. Write a paper together that you can all agree on that says, basically, this is the state of the art in the field as we all agree on it." So that's the goal of these papers. It's also a bit of an adversarial system because people disagree, and often they've written papers highlighting different things or maybe just disagreeing with each other. Now they have to write a paper that they all agree on, so that we give society basically a good summary that we are all behind. And what I understand is that those papers often don't make it. People start the process, but they can't reach agreement on the final paper. So that's where this comes from. I'm not an expert on it, but that's what I read about this. Often, they just don't make it. That's an interesting observation.

SPENCER: Yeah, that's very interesting. I wonder if this kind of approach works better on more specific, narrow questions, like, "Oh, you ran this analysis, you got this result. We collected a data set, and we got this other result. Let's come together and try to find a new experiment we can do that settles it." Whereas, if you're (let's say) debating — you've got one advocate of a certain theory, you've got a different advocate of a different theory — they both, for ten years, have been beating the drum of their own theory, there, I would be much less optimistic that they would come to an agreement.

DANIEL: I agree. That's one component. There's another paper I very much like, which is called "The Many Smiles Collaboration," which was led by Nicholas Coles. And this is about facial feedback, which is one of the studies that also was criticized during the replication crisis. And what they did was very nice. Instead of just getting a lot of people together and say, "Let's do a replication of something that's been done in the past," they said, "Let's get experts together and design the best possible test of this hypothesis." It was a lot of work, and I think there also was a large group of experts, multiple, and not all ended up agreeing in the end. Some said, "I can't really sign off on this." The paper was published, but I think there was still remaining unresolved disagreement at the end. I also feel it's really that some people are just so bought into a certain viewpoint or a certain theory, that maybe you need a bit younger people who also have the expertise, but not maybe decades of their career, basically hinging on a certain theoretical viewpoint. That might be another thing to consider.

SPENCER: It's like we build this pile of psychological weight on not changing our mind. The more times we express an opinion, the more well-known we are for that opinion, the more papers we publish claiming something, at some point, the psychological burden of changing your mind becomes so great that maybe the best among us can do it, but maybe most people can't.

DANIEL: Well, the best among us is an interesting thing. I think nobody is born with the ability to just admit that you're wrong or admit that you made a mistake, for example. I think this is actually a skill we should spend a little bit more time training among scientists. It's difficult for everybody. It should be difficult. But also, we don't really help people with it at all. We never teach people this, which is weird. We just expect people to deal with criticism in general. We expect people to just be fine if they get their first peer review and people criticize your work, and then we just say, "Yeah, it's part of it." We should maybe help them deal with this criticism, not feel too defensive, realize it's part of the system. But the same with people who have a career built on a certain viewpoint or a theory, some people do it, they can give it up. They're just like, "Yeah, it was wrong. This was a decade of my life, and it was wrong." Many people can't and I think that also is understandable, but we should help people with this a little bit.

SPENCER: It's funny how, in PhD programs, they train people in a subset of the skills, but then there's other skills that are really important for being a scientist or being in that field that they don't even attempt to train you in; it's just not considered something that they're going to bother with. An example is, in my math PhD, I had so many classes where I learned theorems, but I had literally zero classes on how to be a research mathematician, like what do you do when you get stuck when you're working on a problem? Or what are some approaches that other research mathematicians use to develop new research ideas? Zero of that. It's like they expect you to magically pick that up somewhere. And we talked about, in a lot of fields, that you're just expected to magically learn to program somehow. Do you think that we should be expanding the sense of what is teachable, or what should be taught?

DANIEL: I think it's definitely a good point to ask our PhD students a little bit more like, "Hey, what are your real issues here? What are other issues that you're dealing with?" And I'm getting too old to remember some of this stuff because my PhD is just too long ago now — 15 years ago — so I'm starting to forget things. But I often go for lunch with the department. We have this tradition; typically, we go to lunch together. Now many staff members are busy, so sometimes I'm the only staff member, but there are a lot of PhD students, and then I listen in to their conversation. "Oh yeah, oh yeah. Of course, yeah, of course you're worried about this thing," or, "Of course, you're thinking how to deal with this." And I kind of figured it out myself over the last decade somewhere, and then (I don't know) we just forget that people don't know how certain things work or how you deal with things. We should listen a bit more and teach them these things. I think it's extremely important, actually.

SPENCER: We've talked indirectly about the importance of sticking to principles in science, doing things well. But often the reward structures in science may work against this, not always, but sometimes. And it's a lot to ask someone to follow these abstract principles when they're actually not rewarded for doing so. Maybe they're even punished for doing so, or they're rewarded for doing something else. So how do you think about reward structures in science and what we have to change in order to make them better so that people can stick to their principles?

DANIEL: First of all, it's the same kind of thing. We don't really teach people that there is a conflict between their principles and things that they will be getting rewards for. We don't discuss it, and again, we hope they just figure it out by themselves. But of course, they are in a position where they're extremely vulnerable for certain influences and for fears or rewards. I don't think that's a very good approach. First of all, I think we should just talk about having a certain set of principles as a scientist. I teach, for example, courses on research integrity, also for PhD students at other universities. And then I always have to say, "Well, P-hacking, it's not so smart to do this anymore, because people can now catch you with certain statistical techniques. And I also just want to say that it is also just not the right thing to do, and it's also fine to just say, 'Hey, I'm not P-hacking, not because I fear that I will get caught, but because I want to do the right thing in science,' and sometimes that even means reducing your probability of publishing a paper, for example. It's the right thing to do." This is probably an important point that we should also seriously discuss with people, and maybe give them the option to think about. What I now find interesting there, for example, is that I'm a senior researcher now. People often say, "Well, but I have to do this for my junior researchers. They still need to build a career." Yes, we should improve the system. I agree, but surely we can't do it off the backs of these poor younger researchers. And then I think maybe we should discuss it. Maybe we should ask these people like, "Hey, what's your goal here? Are you fine sticking with your principles, no matter the cost, or do you want to play the game?" Because that's basically the trade-off that we're asking, right? And I think many people would say, "Well, life is short. A career in academia, as uncertain enough as it is, you know what? I'll stick with my principles, because at least I will never regret doing this. But if I play the game and then I still do not get a career in science, what did I do it for? And I feel bad." So this is a point where I really feel we should discuss this a little bit more with people and give them the real option. And in my experience, people who choose this, it often works out just as well as the people who don't stick with their principles. I don't see a lot of differences there, so it's an interesting thing to take a bit more seriously, just sticking to your principles and discussing them.

SPENCER: I saw an interesting tweet the other day from a former guest in this podcast named Literal Banana, who's a critic of social science and the way it's practiced and stuff around the replication crisis, and the way I interpreted their tweet is essentially suggesting that there may not be that much difference between P-hacking and committing fraud, that it's all kind of the same thing. At least that's how I interpret it; I don't know if that's how they meant it. And I pushed back because I really think that the psychological mechanisms of the two are very different in the sense that, if you go back to ten years ago, if you literally made up data points that didn't exist, like you wrote down numbers on a spreadsheet that you claimed were study participants, I think everyone would know that's unethical, and they would also know that, if you got caught, you would be kicked out of the field; whereas, someone who collects five different outcomes and only reports the one that makes the result look best, there may be some discomfort ten years ago in doing that, but I think on the one hand, many people could rationalize that explaining why we know that really was the right outcome to use. Second, I think people would generally feel, and they would have been correct, that you wouldn't be kicked out of the field for doing that, even if people knew that you'd done that. Maybe they would think slightly worse of you, but it was more socially acceptable. And third, that you probably saw other people do that, too, and it normalized the behavior; you're like, "Ah, well, other scientists I respect do this. Maybe my advisor does this, maybe my colleagues do this." And just the social mimicry kicks in. So I think of it as a very different set of psychological forces between fraud and P-hacking; whereas, maybe now, P-hacking is starting to be thought a little bit closer to the way we think of fraud — still not all the way there — but maybe there's more social pressure against doing it in a way that starts to feel ickier and ickier. What do you think about that?

DANIEL: I completely agree with this reading. And there's a benefit of being a bit old like I am. I was just still a starting PhD student before we figured this stuff out. And it is really educational to have lived through some of these times where you were like, "Yeah, we're doing this with our data analysis. It's not really, really perfect. It's not really how you're supposed to do it, but it's also not really that bad, is it?" And I remember having literal conversations like this, and everybody would just look around like, "It's not that bad, right? No, I guess not." And maybe, if you want to blame us, it's for not just simulating the consequences of doing certain things, but then we didn't know how at that time, so we were like, "Yeah, that's nice, convenient. It's really not that bad, is it?" Of course, people did these things. You saw it around you, so it was normative, and you didn't challenge it too much. So maybe we're to blame for that, but that is really quite different from fraud, which we all knew was bad. And also now, we definitely know how bad it is. Many of us can now simulate the results of these kinds of actions ourselves. And what you see, interestingly again, in the Netherlands, for example, there was two years ago (or maybe three), there was a fraud case or (I would say) a research integrity violation case. It didn't just involve P-hacking. There were other problematic practices like breaking consent, for example. But one of the things that the Investigation Committee noticed was that there was excessive P-hacking at a time when they should have known better. This was a couple of years ago, and was the first time that I saw a committee say, "And we also blame you for P-hacking now." Before, you could get away with it; it was just like you couldn't know — or maybe you could have known, but most people didn't know — so it's okay. And now for the first time, somebody said, "No, you should have known that this is not okay, and we're gonna blame you for it." I think you're completely right. It's definitely switching into an area where now, we know it's bad, and if you ask me, sometimes when I teach junior stuff or junior PhD students, it might even have been going a little bit too far, where they're really afraid to do anything that might even remotely be associated with P-hacking, even if it's actually a decent practice. So there's super awareness, and maybe even people are almost hesitant to use any flexibility, even when it's okay.

SPENCER: Yeah, it's so interesting to see the norms change like this, because this eminent researcher, a number of years ago, published a blog post basically explaining how to P-hack, as though it's a good thing to do. I imagine you know the case I'm talking about, and it was so blatant that you have to think people thought it was just the way you do science. He wasn't trying to hide it. He was telling people how to do it. And now people are like, "Wow, I can't believe he wrote that blog post. That's insane, and I'm not gonna trust his research anymore." That is an incredible shift in attitudes.

DANIEL: And I can understand this tweet that you mentioned where somebody says, "Well, basically, I think it should just be research fraud." But it's also very easy to ignore how different norms are or were. Sometimes I explain to my students that, when I went to university and I would take the train to the university building, that I would sit down and I would light a cigar in the train, and I would smoke for 20 minutes. And people think that's crazy, like, "How can you do this?" But that was completely normal back then, and now you would think that's absolutely crazy that you would sit down in a train and smoke. Yeah, norms change. It's difficult to imagine how things were if you didn't live through this time. I think maybe you can blame people, but you can also cut them a little bit of slack because norms are remarkably strong and, if social psychology taught us anything, it's actually this. Give us also some credit for figuring out that norms are really strong and could quite have an effect.

SPENCER: [laughs] Good point. It reminds me of bribery. If you go to some countries in the world, bribery is just completely normalized; if you want anything to happen in the bureaucratic government, you're gonna have to pay some bribe. And it's considered just normal, where it's like, "Oh yeah, that's just part of how you get paid as a government bureaucrat, is people give you bribes." Whereas, in other parts of the world, it would be absolutely insane to accept a bribe. Everyone would immediately think you were a bad person. And if you live in a part of the world that doesn't have a bribery norm, it's easy to say, "Well, look at how corrupt those other places are," and it is actually a really, really bad thing. If you normalize bribes, that just muddies up everything. It creates this transaction cost to doing everything. Now, lots and lots of people are involved in small-scale corruption all the time. So I think it's actually a really horrible social norm. But it's different to say, "That's a horrible social norm," than it is to say, "Well, all the people in it are acting really badly." You have to have some sympathy and say, "Well, they're stuck in this shitty equilibrium that they didn't choose. And yes, it would be better if they didn't get themselves involved, but once they're in the system, it's pretty hard to resist."

DANIEL: I think that's a very good summary. You can't really blame them for being in this shitty equilibrium. The other part of this is also that I think — maybe that's a naive thought — but I think the progress typically goes in one direction, so you rarely see a society fall back to a system where there's widespread corruption. Maybe sometimes; there's probably some examples, but not too often. And similarly, I don't really see us slip back into a system where people are P-hacking because the awareness is just there, and you'll just feel bad about doing it now; whereas, before you didn't realize. That's also why I'm sometimes optimistic. Science is slow. Change is slow. But I have the feeling it goes in one direction, where things slowly but surely just get better. Again, maybe that's naive, but just let me believe this because it's one of the main drivers of why I'm doing what I'm doing. Don't criticize me on this one please, okay? Just let me have this one thing, please. But I really think it's maybe true.

SPENCER: [laughs] Thankfully, I do pretty much agree with you in the sense that it does seem like we tend to get better scientific methods over time, not worse ones. So it does seem like there's progress being made. But you touched on a point which I think is really interesting, which is that this kind of attitude of avoiding P-hacking can actually go too far. And I've actually seen this quite a bit. An example that comes to mind is people saying, "Oh, you shouldn't collect multiple outcome measures. If you're doing your study, just collect one or two and pre-register what they are, because if you collect the others, well, how does anyone know you didn't just P-hack it?" And from my point of view, this is actually completely backwards. The way that you understand the world is, you look at it in lots of different ways, and collecting multiple outcome measures is a great way to do that. Let's say you're writing a study looking at the effects of meditation. I would want to look at lots of different effects of meditation. The way to better understand the world is not to measure fewer things. It's to measure more things, but be very cautious in your interpretation. So let's say you measure anxiety in two very different ways, and only one of them shows a finding. Well, the previous bad way to deal with that is, you just report the one that has a finding. That's obviously terrible; that's P-hacking. The good way to do it is to say, "Well, we measured this in two ways. Only one found a significant finding." That's something interesting. Let's explore why that is. Was the one that found a significant finding because of a fluke chance? Was it just a false positive? Or does it say something deeper about the nature of anxiety and the effect of segmentation anxiety?

DANIEL: I recognize this completely, and actually, because I work in this field, I get a lot of emails from people who say, "Daniel, I did this thing, and then somebody accused me of this being a bad thing. But is it?" For example, yesterday, somebody said, "Okay, I'm supervising a PhD student, and this PhD student just had limited amounts of data, collected it during COVID. This PhD student is exploring a lot in their study, but they're being very transparent about it, and they are saying, 'Look, I can't say that these are error-controlled claims. It might be a spurious thing. I'm just describing everything. You should consider my work input for a future theory.'" And somebody says, "But what you're doing is P-hacking." And if you are very naive, you don't understand what it is. I can understand why somebody is confused about this, but what you actually see is that we have exactly the same problem we had 15 years ago; people are doing things mindlessly. They don't understand the basic principles underlying why something is right or why it's wrong. It looks like something that they associate with P-hacking, and therefore, it must be wrong; whereas, they should just understand what the real issue is based on some basic principle, maybe even (God forbid) a philosophy of science, where people say, "Hey, wait, this is actually the goal of what we're doing." And then they can evaluate, "Oh, no, this is actually fine," and feel more comfortable with a bit more flexibility at times. The worry is, of course, people still don't understand what they're doing, and then they just become stricter, but not necessarily better. That's a realistic concern but, again, you can think of this swinging back and forth, people being overly strict, and then they will compensate a little bit more. Eventually we will get there. Iit's, in some way, a logical response, but it's a regrettable one.

SPENCER: I think the example you gave earlier of collecting data in sequence and then deciding whether to collect more data, how there's a version of doing that that's P-hacking, and then there's a more sophisticated version of doing that that's actually totally fine, statistically. But if you just learn, "Oh, you're not supposed to collect some data and then decide whether to collect more," it's very easy to see why someone would think that that's a problem if they hadn't learned the more advanced technique,

DANIEL: And I love when I teach about statistics, to start with a case like this. I say, "Look, imagine that you're doing this. You're collecting data. You decide, if you go on, is this right or wrong?" And all these young first-year PhD students with all respect, they try their best, they learn, and they're like, "No, no, no, this is wrong." And then I say, "No, it's not wrong. It's actually what you're supposed to do because, not only can you do it by controlling your alpha level, but it's even more efficient than if you don't look in interim analysis." This is what happened when we tested the COVID vaccines. We did this with an interim analysis like it's done everywhere, because it's super efficient. I kind of like that, and it's a nice teaching trick. You can push people and say, "Hey, you think you're doing the right thing now. But you're still not thinking about it, just like we used to do 15 years ago. So nothing has changed." But again, people will learn. In general, there's movement in the right direction. But yeah, of course, there's some moments, or some groups of people who still miss it.

SPENCER: I think that's true of a lot of P-hacking though, that there's a version of it that looks the same but is actually valid. Another example that comes to mind is, if you collected a bunch of data, it's a very natural thing to say, "Let me just look for all the strongest relationships." But of course, the response to that is, "Well, aren't you just going to get a bunch of bullshit, a bunch of things that look statistically significant?" But it's just that you've tried so many combinations, and that's true if you stop at that level of analysis, but then you can go up one more level and say, 'Well, but if I'm going to go run a confirmatory study afterwards," maybe that's actually a great way to come up with a hypotheses potentially. Or why not take some of your data, hold it in a vault, don't look at it. Do whatever exploratory analysis you want on the data that you're allowed to look at, and at the very end, test the most promising-seeming hypothesis on the data you locked in a vault and make sure it still holds up. That's a very robust way to handle this, but still allows you to go wild exploring whatever you want.

DANIEL: I completely agree. And I think one of the things we'll see in the future is people are starting to develop methods. They exist already, so maybe I should say educating people about methods where they can gain back a lot of flexibility. I'm a big fan of practices such as pre-registration, but what I too often see people do is tie themselves down too much in their pre-registration; whereas, you can pre-register quite flexible analysis methods if you know what you're doing. I think that's going to be one of the next things, because now people complain, "Hey, I pre-registered, but I often have to deviate from my analysis plan because something unforeseen happens." And then I think we're going to educate people in the next maybe five to ten years saying, "Hey, there are all these methods you could use where you can actually have much more flexibility and pre-register and follow a pre-specified plan."

SPENCER: It's a subtle balance of having enough flexibility in the pre-registration so you say, "Here's how I'm gonna analyze it," and then, depending on the way the data turns out, you can still do something that's valid without deviating, versus just putting something in pre-registration that's so vague that it lets you get away with P-hacking.

DANIEL: Yeah, that's what you see often now, but I think they will become stricter, and that's my prediction because that's what we see in other fields. Medicine also started with very general registries, where you just say, "Hey, this study exists, and you know this is our dependent variable." But then over ten, 20 years, they also move towards a system where things have to be specified in much more detail. I think we'll continue to value this, and we'll look back at fake pre-registration, which are very common now, and say, "No, that wasn't good at all." But then hopefully, the more specific we are, we can still build in this flexibility. And I think people will hopefully be a bit more positive about pre-registration as well when they realize there's still a lot of flexibility left, if you just do it well.

SPENCER: Before we wrap up, I'd love to do a little rapid-fire round with you, where I ask you about different sub-areas and results in psychology, and just get your quick take on them. How does that sound?

DANIEL: Sure. Well, let's give it a try.

SPENCER: All right, so you mentioned (I think it was) the smile project, facial feedback hypothesis. Can you tell us just briefly, what is that, and what's your view on whether it's real and what's really real around it?

DANIEL: It's a nice example where our bodily experiences can influence us. And I think what it shows is actually that this is true. Smiling, intentionally smiling, can have some effect. And what it nicely shows is how difficult it is to explain why it does this. And we love the subtle mechanisms, the implicit mechanisms, but it might end up being quite a very strong sort of demand-like effect, or something much more conscious. So that's an interesting research area.

SPENCER: The demand effect, meaning people feel like they're supposed to report that they're happier, and that's why they say they're happier when they're smiling?

DANIEL: Exactly. They are just completely aware of what they're doing and think, "Hey, how would this influence my effective state at this moment?" So it's a much more conscious, aware process.

SPENCER: It's funny [laughs] because my quick take on the facial feedback hypothesis, when I smile, it just makes me happier instantaneously and impossible to ignore. My theory is it's a little bit like colorblindness. You imagine that 95% of people are colorblind. They might just deny the existence of certain color differences. It's like, I'm one of the weird people where I smile and I instantaneously feel happier in a way that just seems impossible for me to deny.

DANIEL: Good for you.

SPENCER: Yeah, exactly. It's kind of silly to me. I'm like, "Okay, this is not real? Okay." But I believe that I probably am unusually strong in this, and that many people may not have this. Okay, power posing.

DANIEL: [laughs] Well, I think this is historically a very interesting finding because there's really nothing special about it in the way that we use it now, which is as one example of something that is maybe a not replicable effect. Again, it has a similar issue where the implicit effect, the biological effect, might not replicate, maybe the more intentional or more demand-like effect is still there on self-report, by the way. It's an interesting historical case where, due to bad luck, one specific finding became very strongly associated with not replicating; whereas, there are hundreds of studies that could have happened. So somebody got very unlucky with being associated with this one finding. That's why I think this finding will go down into the history books eventually,

SPENCER: And this is about how adopting certain poses with your body can create these effects, like feeling like you're more powerful, but also in the original claims, that you increase risk-taking and change cortisol levels. But, I think you're saying that this was a fluke of history, that this effect got so much focus and critique; whereas, there were lots of other effects that were potentially equally dubious.

DANIEL: Yeah, exactly. And again here, it's like you're smiling. Some people just do it and they feel better, but it is because they are convincing themselves that doing this will make them feel better. So it's not through this cortisol process at all. Well, if they're happy with it, why not? But, yeah, scientifically, it's not something you might want to recommend to everybody.

SPENCER: Yeah, my take is it's similar to the facial feedback hypothesis where I think it's a real effect, but it's a very small effect, that certain postures will make you feel slightly more powerful, maybe improve your mood slightly compared to other postures. But it may not apply to everyone, and a lot of the other additional findings, like cortisol levels, risk-taking, probably don't replicate. But anyway, that's my take. Okay, ego depletion. What is ego depletion and what do you think is true about it?

DANIEL: Ego depletion is the finding that, if you used up a lot of your cognitive resources to constrain yourself... For example, you're sitting next to a birthday cake but you're on a diet, you don't want to eat the cake, and after a long time, you are becoming depleted, and you give in, because you can't restrain yourself, you can't control your ego anymore so you go for the cake. That's basically the finding. They try to tire people out and then measure things related to behavior that people didn't want to do. On one level, this is completely obvious. Everybody feels that sometimes you're just so tired and you're just gonna eat this bag of crisps because, what the hell, you're just giving in. On another level, there's just a lot of work on fatigue, for example, where you get tired and you stop controlling yourself. So one part of this is completely obvious. Here again, the weird thing about it is the theoretical, this specific theoretical model, it really was supposed to work with the glucose level in your brain, and there was some sort of resource that could be depleted. All of that was probably completely wrong. There was actually a lot of criticism biologically on this. The ego depletion effect is also an extremely important research finding for meta scientific reasons, not for the finding it was supposed to be. It tells us that we can have almost 200 studies in the scientific literature that are part of a meta analysis. And we thought this is great, large effect, super relevant. And it turns out it was all nothing. And we know this because not only have other people tried to replicate it, but in response to this, many of the original authors in this field said, "What the hell, you bunch of losers. You can't replicate our stuff? We'll do it ourselves." And they also failed to replicate it. So it is one of the most conclusive null results we have in the field, even though we had 200 studies showing it. It is the big warning of what you can do if you mess up in science.

SPENCER: Some people might look at that and say, "What on earth is happening? How could you have so many studies showing an effect without the effect being there?" Just what were people actually doing? Is there something mind-blowing about that?

DANIEL: I think there were really just a ton of people studying this, and this is really an extreme case. Well, there's probably some P-hacking in there, but also just a huge file drawer. I know many people who ran studies on this didn't find it and then didn't publish it. Yeah, maybe P-hacking, but also an equally big warning of, if you don't share your null results, we will just end up with a literature that's filled with flukes, with the false positives. So hopefully, if we take this seriously, somebody's finally going to do something about this file drawer, because you get extreme situations like this. Now we don't know how common they are. Hopefully, they're rare, but they do happen, and this is such a research waste.

SPENCER: Something that binds these last three effects we talked about is, at least from my point of view, there is a real effect that is to be found for all three of them, I think. But the effect is not necessarily to be interpreted the way that the original research teams claimed, if that makes sense. Yes, people get tired. Yes, people get bored and frustrated when they make lots of decisions. Is that ego depletion? But you could see how, if you're not incredibly careful in your experimental design, you could detect an effect that's just a really boring, mundane effect, but make it out to be something else. Or with power posing, okay, maybe people do report being a little more powerful when they're in certain poses. Maybe that's true, that they do report that, but that's different than saying that it's going to change your cortisol levels and risk-taking and so on.

DANIEL: And one interesting thing that you see in these areas, for example, about ego depletion, there was a researcher very active in this field, Mickey Inzlicht, and what they ended up doing after this is thinking much better about their concepts, really conceptualizing, "Wait, what is it exactly that we're interested in?" and not just taking something that sounds very fancy and exciting from a shelf because somebody else put it there, but it wasn't really thought through. You see people go back to a lot of conceptual work. What is it that we're really talking about? Then a lot of measurement work. How can we actually reliably measure this? And they are now doing great work on this, but I think they will just call it fatigue most of the time. Still a very important topic to study. Sounds less exciting, but it is super relevant.

SPENCER: Okay, what about terror management theory?

DANIEL: Oh, [laughs] you've picked some nice ones. Their management theory is an interesting one. It was hugely popular somewhere in the 80s. I started my career after this. So it's difficult for me to explain why it was so popular. But the basic finding is, if I remind you of the fact that you're gonna die, compared to a control condition, where I show you a neutral video — I just ask you to think about what you've been doing today or something — the people who are asked to think about dying will be in terror about this. They will feel insecure, and they will want some sort of certainty. So they will become more conservative, for example, or they'll have a preference for things that will give their life some sort of meaning and continuity. An interesting observation. I think this, in my view — and I've been involved in a review process — I think this paper is online, but I'm not 100% sure as a preprint, I think so. I'm not 100% sure, so I don't want to say more about it, but I think this is gonna go in the same direction as ego depletion. It will just end up being one big warning of how you can mess things up. Here also, we know this in part because of a replication project where the original authors were involved. They were supposed to pick a study that they thought would replicate, and they couldn't even do this. They couldn't even pick one themselves that had a high probability of replicating. So this project showed null results as well. So it's in this area. And here, I'm very curious if you're going to say that there's actually some truth in here somewhere, because in this one, I really don't know where it would be.

SPENCER: Yeah, [both laugh] this one, I don't have any deeper truth that I think is underneath. The whole thing just might be nothing, as far as I...

DANIEL: Yeah, I fear you might be right.

SPENCER: That was literally what I was gonna ask you. Is there something to cling to, like fatigue in the ego depletion story? But maybe not. Okay, next one. Grit.

DANIEL: Ah. Okay. If you'll allow me to mention very briefly that I have a podcast myself called Nullius in Verba with Smriti Mehta — she's a postdoc at Berkeley, and she studies this topic — so I am basically channeling things that I've learned through her. This is definitely something where there is a real truth underlying this. At least, that's what she thinks. So grit has become also synonymous with flashy findings, or an interesting, exciting finding that might not be as big as it is. There definitely have been a lot of studies that show that, for example, very simple interventions don't have a large effect on how... Well, grit, it is basically conscientiousness so how well students will keep learning or keep working on something. That is grit, which you need to succeed in academia, for example. What it shows is that very simple, one-off interventions, maybe in early work, people were excited about those, but it shows that those don't really have a big effect, as you would expect if you go into the classroom and you say one time, "Hey, students, it's important you do this or this," it doesn't have a big effect on the end of their grade. But I think that this is an important topic in general. We probably want to manipulate it in different ways. But here, I feel there's much more truth to this being an important thing and actually also practically relevant. Effects might be small, but again, because we manipulate it in kind of silly ways, and if you would have a little bit better interventions, I feel this has something going for it. Other people would just say, "Let's just call it conscientiousness and not grit." But that doesn't sell books, so maybe, okay, that's the only thing you can blame it for. But here I'm a bit more optimistic, although this is mainly inspired by Smrit Mehta who basically tells me about this.

SPENCER: Yeah. I think that's the tricky question there of how different is it from conscientiousness? Because if you read the actual items of grit scales, it's not exactly conscientiousness. As I understand it, grit has two subscales. One is about working really hard towards a goal, and I think the other, if I'm not mistaken, is about not switching your goal too often. Should we think of it as a facet of conscientiousness, which maybe, for certain things, could be a better predictor than conscientiousness? Or should we think of it as just a proxy for conscientiousness, and it just behaves exactly the way any kind of big five conscientiousness scale would?

DANIEL: I'm not an expert on this, so I find it difficult to generalize, but I feel that here, there is some underlying truth to a process where some people can just stick with it a little bit longer. It might be much more contextual, also in exactly how we should measure it, different in an academic context and in work context or stuff like this. But here, I'm a bit more optimistic that this is an important thing. And here I would really warn people — again, channeling my co-host Smriti — but I really want to warn people a little bit that we're not throwing something away because it's fashionable to criticize it, but that we give it a real shot and take it really seriously, also the criticism on it, and don't just think that this is no longer important.

SPENCER: Yeah, I wouldn't be surprised if these particular grit scales are a bit better at predicting certain things than conscientiousness broadly, because conscientiousness broadly is going to include things like being organized, being on time; whereas, grit is really honed down on the elements related to goal achievement. And so you could see that, perhaps in some contexts, that would be a better predictor than conscientiousness broadly. But it'd be interesting to see how that plays out in the literature over the coming years.

DANIEL: Exactly. I think it's very interesting to see how, for example, terror management theory will develop in the next decade, and grit. And then I'm putting my money on grit.

SPENCER: Okay, only two more, then I'm gonna let you go. Growth mindset.

DANIEL: Again, I mean, this is very strongly related to what Smriti studies. So again, I'm just channeling what I know from her. But yeah, this is also, again, in one-off interventions, maybe not the best thing, but actually probably an important skill to teach students. And to be honest, I do this. I teach my students in the first couple of lectures that it's important to see your own performance as something that improves over time and that you can become better at this, and stressing this aspect, compared to just becoming the best in your class, or something like this. Yeah, I am very sympathetic to this concept, and I hope it has something going for it.

SPENCER: It makes a lot of intuitive sense that, if you believe that, when you fail, it just shows that you're shitty, that that's not going to be as productive an attitude as if, when you fail, you think, "Oh, I just need to work harder, or I just need to learn more." Clearly, the former is a defective life strategy compared to the latter.

DANIEL: The letter is also true. We just suck at a lot of stuff. And I think just being very honest about this... I often talk in academia to people and I say, "Look, we are just very incompetent in a lot of things." And it's almost like it's a bad thing to say, but we are pretty incompetent. But we can learn, we can try, we can study, and if you invest a lot of time, you can become pretty good at something, but you start somewhere and you're not there yet. So I find this growth mindset actually just very much in line with also my view on how you live your life as a scientist. You start out just not very good, but you can really become quite better if you invest time. I don't know, maybe I'm too committed to it, but I feel this one has something going for it.

SPENCER: No, I agree with you, and I think most of the critique is around, well, how much is it explained. And so to me, it seems like a useful insight. It's worth measuring, but is it gonna explain more than a few percent of the variance in any particular outcome? Probably not. That would surprise me a bit.

DANIEL: Yeah, but also we should, if we think this is an important skill to learn, we should really teach people it more than one-off intervention somewhere at the beginning of their academic career or something. Then we should remind people again and again and train them in this. Then the effect would be maybe a little bit bigger as well.

SPENCER: That makes sense. And also, I would just say it depends on the cost of training. But if it's a pretty easy skill to teach, which it probably is, then, if it gets a little bit of boost, that could be really well worth it, even if the effect size is not huge. And I think most of the critique of growth mindset is just, well, maybe people made it seem like the effect size was much bigger than it is.

DANIEL: Yeah, yeah, exactly.

SPENCER: Okay. Final one for you, the implicit association test.

DANIEL: I find it methodologically a very interesting task. It is a nice paradigm to measure how people group things together. I'm very sympathetic to many of the criticisms on this. I teach advanced cognition, and one of the lectures is on associations. And then I teach people how you criticize somebody in academia by giving them one of the papers critical on the IAT, which is called problems with the I, the A and the T, where the authors say something like, "Well, the IAT is a compatibility task — they don't even want to call it an association task — which has been proposed by blah blah blah citation, but also has received some criticism," and then the rest of the paragraph there — there are 30 citations or something — of criticism, which is a lovely way to criticize somebody in academia, very subtle, but I agree with a lot of the criticism. I think what it is actually measuring is very difficult to establish, and we know that there are all sorts of confounds playing a role. And it would be nice if this was maybe communicated a bit more clearly also when people go online and they do the ID on the website, for example, where you say, "Look, this is not really an implicit association," maybe.

SPENCER: Because I think one of the challenges with it is that it was billed as a way to measure, let's say, how racist we are, how sexist we are, that goes beyond the conscious mind, and so maybe we can pick up on things that we didn't even know we had. Like we didn't think we were racist, but we're actually racist. And if it worked, that's a brilliant concept, right? But then we find that the test-retest reliability is really low. So if you take it now and you take it again in a week, often you get different results and that doesn't really make a lot of sense if it's really measuring your deep implicit bias. And then also, there's some questions about, what does it really mean? What is it actually measuring? But my view, and I'm curious if you agree with this, is that there are some use cases for it still, but they're more like use cases where you're averaging over a population and you have to be very careful about how you're interpreting it, and not necessarily saying, "Well, because we're measuring an implicit bias, that means people are being racist," or something like this. You have to be careful about the claims you're making about what's measured.

DANIEL: Yeah. Well, I really think there's some very convincing papers where they show that you can use this task to create associations between anything just based on relative salience of the two categories. For example, words and non-words. Well, words are more salient than non-words or something, and then they have another category to colors or something. You can create things that map together in this task based on many other things than implicit associations, but just all sorts of features of the task. I think that should really be communicated a bit more, because I also feel that many people — maybe some countries a bit more than in others — but where, if you do the IAT, you're not supposed to challenge that you have implicit racism, for example, because of what this task shows. So that's just not how it works regrettably. I feel that this is a task people don't want to criticize, maybe, as clearly; whereas, it deserves the criticism. It really would be nice if this was a bit more clearly educated to people who use the task, or who take part of it online, or even to students. I've heard stories where people in classrooms are taught that this measures your implicit attitudes and, without any criticism on the measure itself, I don't think that's a very good thing.

SPENCER: Daniel, final question for you before we wrap up: what do you hope to see in the next ten years in social science?

DANIEL: Well, from all the topics we discussed, which has some of the things I think are important to think about, I would say more collaboration over time. I really feel that we're too isolated in what we do, and if you want to make the contribution that a field like ours can make — because I do feel that we are studying something extremely important, especially with many of the challenges that we face in this world — I think we need to band together a little bit more, get serious about collectively solving some of these issues, because they're so difficult that, if we want to make a real dent in them, we should collaborate much more than we've been doing in the past. So that would be lovely. If you make me the science dictator, then that's what I would push for.

SPENCER: Only for a year, remember.

DANIEL: One year. Yeah, okay, let's do it. [Spencer laughs]

SPENCER: Daniel, thanks so much for coming on. This was a great conversation.

DANIEL: Thanks.

Staff

Spencer Greenberg — Host / Director
Josh Castle — Producer
Ryan Kessler — Audio Engineer
Uri Bram — Factotum
WeAmplify — Transcriptionists

Music

Affiliates

Click here to return to the list of all episodes.

CLEARER THINKING

Episode 220: Should we trust papers published in top social science journals? (with Daniel Lakens)

Contact Us