Episode 235: Highs and lows on the road out of the replication crisis (with Brian Nosek)

Enjoying the episode? Want to listen later? Subscribe on any of these apps or stores to be notified when we release new episodes:

Listen on

Apple Podcasts

November 9, 2024

How much more robust have the social sciences become since the beginnings of the replication crisis? What fraction of replication failures indicate that the original result was a false positive? What do we know with relative certainty about human nature? How much of a difference is there between how people behave in a lab setting and how they behave out in the world? Why has there been such a breakdown of trust in the sciences over the past few decades? How can scientists better communicate uncertainty in their findings to the public? To what extent are replication failures a problem in the other sciences? How useful is the Implicit Association Test (IAT)? What does it mean if someone can predict how they'll score on the IAT? How do biases differ from associations? What should (and shouldn't) the IAT be used for? Why do replications often show smaller effect sizes than the original research showed? What is the Lifecycle Journals project?

Brian Nosek co-developed the Implicit Association Test, a method that advanced research and public interest in implicit bias. Nosek co-founded three non-profit organizations: Project Implicit to advance research and education about implicit bias, the Society for the Improvement of Psychological Science to improve the research culture in his home discipline, and the Center for Open Science (COS) to improve rigor, transparency, integrity, and reproducibility across research disciplines. Nosek is Executive Director of COS and a professor at the University of Virginia. Nosek's research and applied interests aim to understand why people and systems produce behaviors that are contrary to intentions and values; to develop, implement, and evaluate solutions to align practices with values; and, to improve research credibility and cultures to accelerate progress. Connect with him on Bluesky or LinkedIn, or learn more about him on the COS website.

JOSH: Hello, and welcome to Clearer Thinking with Spencer Greenberg, the podcast about ideas that matter. I'm Josh Castle, the producer of the podcast, and I'm so glad you've joined us today. In this episode, Spencer speaks with Brian Nosek about progress in open science, the value of replication, and trends in psychology research.

SPENCER: Brian, welcome.

BRIAN: Thanks for having me, Spencer.

SPENCER: You've been one of the most influential people in terms of improving science and open science that I've ever followed. I'm wondering, how do you see the fruits of those efforts? Do you feel we've made progress in making science more robust?

BRIAN: I think so, but that is pretty speculative. I think so because we don't really have enough evidence yet to evaluate the impact of the new practices that have been happening in this last decade of reform. I think the early indicators are positive, but there's a lot more work to do.

SPENCER: When we look back at work that has gone and replicated papers from 10 or 15 years ago, my interpretation of those results is that they often would find that something like only 40% or 50% of the results would replicate. If you try to redo the experiment exactly as it was done, recruiting a new population, you often wouldn't get the same results. Now maybe if you adjust for false negatives, or the replication study is not exactly following the original study perfectly, maybe we could conclude something like only 40% of the results were not replicable. But is that your understanding of the state as of 15 years ago in social science?

BRIAN: I would say, on average, that's about right. The challenge, of course, is in the details. Why did a replication fail? It's easy to draw the conclusion that a replication failed because the original result was wrong, but that's not demanded by the data. There could be failures in how the replication study was designed, so it's really a false negative, as you say, and it can be that there is something meaningful about the phenomenon that is different between the two, that accounts for why we observed it in the first case and didn't observe the phenomenon in the second case. Sometimes those are known, or at least knowable, and somebody knows why that difference matters. In many cases, and perhaps most, nobody yet knows, and that's really what the value, I think, for replication is in advancing scholarship. It's less about identifying false positives full stop. It's more about prompting new questions and interrogating why we saw a signal here and didn't see a signal there. What might we try next to understand better under what conditions this particular phenomenon is observable?

SPENCER: So for example, maybe the reason it didn't replicate was because the original study was done on college students, and the replication study was done on a broader population. Maybe the finding was something about college students, not about people in general.

BRIAN: Right. Exactly. Oftentimes, we observe an initial finding and presume a very general application of that finding. When I do studies in the lab here at the University of Virginia, I don't think this is going to fail to replicate at Ohio State because Ohio State students are different. I just assume this is more general than University of Virginia students. But that could be wrong, and if it did fail to replicate at Ohio State, then that at least prompts a question: what about this might depend on the sample in ways that our original conception didn't anticipate?

SPENCER: I'm kind of surprised to hear you say that. Maybe I'm just more skeptical than you, but my interpretation of a lot of these results is that indeed there actually were genuinely false positives, not merely that we need to figure out what the hidden confounds were, why we get the results sometimes and not others.

BRIAN: I think many of them can be false positives, but my point is that we don't know that they are. Just failing to replicate once is not sufficient to draw that conclusion, and one has to be open to the possibility that there are conditions. This sort of aligns with my more general philosophy of science that comes from Will McGuire and others called perspectivism, which embraces the radical idea that every claim is true under some conditions. The way we make progress in science is by identifying those conditions under which you can observe evidence for that claim. Now, of course, at the extremes, that can be a challenging perspective, but I think in a field where there are lots of variables, where identifying causation is hard, where many things are likely interacting with one another, I find the perspectivism approach to be really appealing because it assumes that every claim is correct and every claim is incorrect, and the point of making progress is to figure out those conditions. When does this matter and when does it not?

SPENCER: That's really interesting. I would imagine that there are some perspectives that say, "What we should look for in science are things that are as universal as possible. We want to find those underlying patterns that, okay, maybe there can be some exceptions, but especially in something like human psychology, there will always be exceptions, but where the exceptions are minimized." Rather than thinking of it as, "Let's try to find the conditions under which this is true or false." We want to say, "Let's look for stronger, more universal patterns."

BRIAN: I certainly think that it's very useful to have those because they provide anchors and constraints and useful ways on theory development, on understanding nature more broadly. For example, within the psychological sciences, I would think of vision research as being closer to what you identify, which is that there's a lot of consistency in how the visual system works across people. Identifying the features of how the visual system works can sometimes be simple, as the wrong word can be without many factors influencing the phenomenon. But even that gets more complicated over time. There are places where simple effects, main effects, explanations, and expectations of high consistency across different contexts should be anticipated and are observed. There are many more cases that are very conditional. Part of the challenge of being in a field like the social sciences is living in that conditional world.

SPENCER: Do you worry at all that it gives scientists a kind of cop-out, saying, "Well, we didn't really disprove my result. It's just that it only works in this tiny, little irrelevant area."

BRIAN: Yeah, that's exactly right. For me, what those replication projects were about is — I've been making the pro case on — everything's conditional. It's complicated. There are so many factors. It's not a surprise that something might fail to replicate because there could be unseen factors simultaneously. It is not a Get-Out-of-Jail-Free card. The challenge that we have to lean into, I believe, to make real progress, is when something fails to replicate, when we didn't have a strong theoretical expectation for why it would fail. We need to take that seriously, and we need to recognize that it is challenging the general claim, perhaps bounding that claim to a much narrower zone of where it is observable, and perhaps ultimately revealing that that wasn't as productive an area of research as we thought. I think we were too much in the mindset of, "Oh, well, you didn't observe it with participants that came in toward the end of the semester. I ran it at the beginning of the semester. Therefore, the time of semester explains why you observed the effect, and I didn't." That's too easy to do retrospectively, when really what we should be doing is saying, "I have a hypothesis. Maybe the time of semester matters. Let's test it."

SPENCER: Something I've had happen a few times now, where I have psychologists on the podcast, is I'll ask them a question along the lines of, "What are the discoveries of psychology in the last 25 years that you're excited about?" And they express this attitude of distrust in the field, to the point where they feel like it's hard to really know what they can believe, that's a new finding. They often will point to something, but it's sort of a little bit disappointing. I'm wondering how you feel about that. Do you feel like we really have made significant progress in understanding human nature in the last 25 years? Do you feel like we've learned more about how not to do science, and now we're improving it, doing science, and then that will eventually lead to a huge wave of discoveries? What's your thought on that?

BRIAN: Certainly there has been within psychology some degree of reckoning, of reassessment of things that we thought appeared to be settled science, if that's even a reasonable concept. That has provoked exactly the reaction that you described: well, gosh, if I can't trust that, what can I trust, and the uncertainty that comes along with that. I think if we have people in the field open up their intro psych textbook and start to go through it, there are a lot of areas of scholarship in psychology where they'd say, "Well, okay, yeah, of course, that's good learning theory. We've made a ton of progress in learning theory; there are some solid things there. The Rescorla-Wagner Model that was really successful is getting replaced by that model and this model, blah, blah, blah." We're learning more about those principles. There will be areas where they look at and say, "Oh yeah, this one I'm not so sure about, and I think that variation is okay. I have worried about fatalism in response to these, thinking that nothing is trustworthy." For me, the first response is, across these various replication projects, yes, we're surprised at the replication rates that are observed. But it wasn't zero. It was 50%, 60%, 30%; it depends on the study and the sample and the context, but many things do replicate, just maybe not as many as we thought would.

SPENCER: When you think about findings in psychology over the last 25 years, what are some standouts to you of new things? So not talking about theories from 50 years ago, but things of the last 25 years that you think might be important breakthroughs that are real and robust?

BRIAN: Partly, this is challenging to answer because I've spent so little time over the last decade actually in the psychological literature. I've been much more meta-scientific oriented in my daily activities. But nevertheless, the areas that pop to mind are the things that I got excited about in research. So I'll name those, or at least one of those, which is the research on moral psychology. I have been quite enthusiastic about the work that's been happening in that domain. I contributed a small amount of work with my colleagues, Jesse Graham and Jonathan Haidt, on moral foundations as a research topic and theoretical orientation. But there are so many interesting things that have been happening in the moral psychology space. Kurt Gray has been doing really interesting work. There was a lot of activity on the trolley problem, which is now comical as something to bring up, but it produced this really interesting set of findings that I think grounded that area of research. "Well, here's a phenomenon. Okay, let's try to unpack it and use it as a paradigm to understand people's moral intuitions." I really like those domains where there is something that anchors the research. It's like, "Here is an observation that everybody is interested in. Here is a methodology, an approach, a paradigm." I think when those get identified as something interesting, it can produce lots of new insights. It can also run aground very quickly, where it becomes research about that particular paradigm and not about the phenomena it's supposed to be investigating. But many times it yields a lot of fruit.

SPENCER: For those interested in moral foundations theory, we actually talked about it in the episode with Caroline Mehl, so you can check that out. It's interesting to think about when research can kind of fold in on itself and just end up being about itself. If you take something like the trolley problem, it can be a really interesting way to tap human psychology. Say, when would you pull the lever to save one person, save five people, etc.? But then you could say, "Well, is that actually what people would really do if they were presented with the real moral challenge? If we can't demonstrate that it's actually close enough to real moral behavior, then what are we really studying? Are we just studying how people model their own moral behavior? What does it have to do with real moral behavior?" There are interesting questions here.

BRIAN: It's often that trade-off between experimental control and ecological validity. You get a paradigm, and you get more and more constrained to the paradigm, and you lose sight of the phenomenon that you're investigating. The desire to do the trolley problem for real, the only barrier, I think, is IRB, because that would be great. Let's get people pulling levers and seeing who they save.

SPENCER: Someone I know, as I understand that, she explained to me she got IRB approval to do an experiment where she would give people electric shocks, and then another person in the experiment could pay money to reduce that person's electric shocks. So you could pay some of the money that they were earning in the experiment. And when they did two versions, they did one version where they just asked people what they would do, theoretically, "How much of your earnings would you give up to reduce the electric shocks of this other person?" And then they did it for real. Obviously, it wasn't really for real. It was with an actor pretending to get electric shocked, and people gave up way less money than they said they were going to, which was pretty disturbing. So it's kind of like a modern Milgram type.

BRIAN: It's a nice twist on that, and I love that another area that I think is really interesting is the forecasting work of these predictions of what I say I will do in a simulation and what I do in reality. And I think there's some fascinating advances in measuring that, evaluating that in different contexts. And of course, that's a chief concern when you start to go across social science boundaries, because a lot of the critique, for example, from economists to psychologists is the stakes aren't real here in the situations you investigate, whereas we study the real stakes money. And so these forecasting areas of research, I think, provide a lot of grist for trying to figure out under what conditions are our forecasts accurate predictors of our own behavior.

SPENCER: So we talked about how scientists view science, but there's also an interesting question of how lay people should view science. People have heard about the replication crisis, they've seen major results now being disputed, and I do think that there — I don't know if this has been studied, but just anecdotally — it does seem like there's been less trust from the general public in at least certain aspects of science. How do you think people should engage with it when they read about some new finding coming out in an academic journal?

BRIAN: There is declining trust in science, but the evidence that I have seen suggests that it's declining trust in institutions in general, not just science, and there are lots of obvious, interesting complications and explanations for why that may be occurring. But how should the public engage in science? What I have hoped for, and I think is actually improving, although this is just observation rather than any empirical data I've seen, is a better injection of uncertainty into science communication in ways that the lay public or any recipient of that information can recognize and understand. I think that is the chief challenge for science, is that it's so built into the scientific process that uncertainty is important. This is provisional. This is what we know now. This is based on this evidence. This needs to be examined in ways to see how far it generalizes. It needs to see if it holds under other conditions. That's baked into the scientific process. It's not always baked into the scientific communication process, whether that's through journalists and broader media, or even how science is translated for public consumption. A lot of times the default thinking process is the way we need to make the scientific findings accessible to the public is to remove the uncertainty and just tell them the answer. And I think that's exactly the wrong thing to do. I think the uncertainty is a key part of the story and should always be part of the story when we're communicating scientific findings, because I don't think that the lay public can't understand uncertainty. It is that we have to provide the right information so that that can be appreciated in the right way to just make decisions about, "What does the current state of evidence mean? How should I think about it? How should I use it?"

SPENCER: There's this odd phenomenon where, if a result is not surprising at all, it's just exactly what you'd expect, just fully in line with your intuition, you might not even hear about it. It's probably not going to get much media attention. On the other hand, if a result is really surprising, and especially if we're talking about psychology, where we actually have quite good intuitions. We deal with humans all day long, every day; we make all these implicit predictions. The more surprising it is, the more likely it is probably to be wrong a priori. It's at least this funny thing where the results that you kind of most get signal boosted, or are most likely to go viral on social media, might actually be less likely to be true.

BRIAN: I think that is true practically on first principles. Newsworthiness is itself a violation of intuitions. To the extent that intuitions have any validity, that means that they are less likely to actually be true or hold in the way that they've been presented. That's how it's going to be. That's what newsworthiness means. That's what grabbing attention means: a violation of expectations. The counter to that is making it just as important as finding the uncertainty about that finding. Yes, many of the things that get shared in news media that are about newsworthiness will turn out to be less robust than initially thought. But there's also a broader way to think about science media, science communication, and that is not just the newsworthy or initial individual findings, but also the long-form pieces, the reviews of what has been learned over the last 30 years in this area of research. What do we now understand about this work that we didn't understand before? Those may not be exciting in the newsworthy way that we need to get this out today, but they can be deeply informative and important for readers, to bring them up to speed on what is understood about an area of work, even if many of the pieces are things that we would have said, "Oh yeah, I would have predicted that." This is a different source of evidence beyond my intuitions or my personal experience, and that is an important thing to communicate and add to the public discussion.

SPENCER: I don't think that many non-scientists read systematic reviews, but I find them incredibly informative, where scientists will take a whole area of research on something and say, "Well, what do we really know? What is their consensus on? What is uncertain?" It can be just a wonderful way to learn an incredibly dense set of information.

BRIAN: That's right. The systematic reviews aren't written for general public consumption, although it's great when some do, but it's easy to imagine long-form journalism also taking on systematic reviews and saying, "Here, this is why researchers have been studying this for 30 years. Here's what we've learned from that over those 30 years." That's a very productive and underused approach to scientific communication.

SPENCER: We've been focusing mainly today on psychology and social science more broadly. But how big an issue are these things in other fields of science? I know that you did one big replication study in biology, for example, and you found that many studies there didn't replicate.

BRIAN: The easy answer is, we don't know how broad these issues are, but the more elaborated answer is that some of the factors we perceive to be influential in creating conditions of lower than desired or recognized replicability are present in virtually all scholarly fields. For example, the pressure to publish exists in basically any area of scholarship. If not everything gets published, and certain things are selected more likely for publication than others, then whatever bias is in that selection process could have implications for the credibility of that literature. These general factors can create a reasonable hypothesis that there are issues to address across different scholarly fields, but how those issues manifest in the field can be very methodology or topic dependent. We can't rely on the evidence from one field to say this undoubtedly applies in this way across every other field. For the same reason that we have concerns about original studies for whatever phenomenon they investigate, meta-scientific studies have no special status of not needing to evaluate their own replicability or generalizability. The work right now that is useful and urgent is conducting similar types of evaluations of replicability, reproducibility, and generalizability across different disciplinary domains to see where there's consistency and where the challenges are not as great as in other domains. One of my pet hypotheses for where we will see better and weaker replicability is the ease of data acquisition. In my observational experience, those areas of research where acquiring data to test questions is easy to make progress reliably, regularly, and replicability is likely to be high. The reason is that when it doesn't cost anything to do it again, for example, then doing it again happens all the time. It's built into the process of doing research in that domain.

SPENCER: Any examples where that applies?

BRIAN: An example from my own field is in vision research and associated domains. Since the visual system largely works similarly across people, it's easy to present many stimuli to any one person testing a question about issues related to vision. Once a phenomenon is observed, you can say, "Oh, we can do that again. We can do that again. We can do that again. You can see if it holds here and make these little tweaks." The process of accumulating knowledge can happen relatively quickly and easily. Dan Simons, my colleague at the University of Illinois, Urbana-Champaign, would say, "Yeah, we would go to vision meetings, and someone would show a new effect in their talk. Everybody would say, 'Oh, that's interesting.' They'd go back to their hotel room, program it up, and do it on themselves to see if they could reproduce it." I love that as an idea of, "Oh, man, okay, that's an area of work where reliability comes naturally in the system." The challenge is that it's not easy to acquire data rapidly and at the scale needed for many types of questions and in many types of domains. I think those are the areas that are going to be more challenged. Again, this is just a hypothesis, but that's where I would look next if I were able to do such a survey.

SPENCER: It seems to me that in different domains, there are kind of rhyming problems, but sometimes the problems differ a little bit based on the type of science. For example, talking to biologists, what I've learned is that there can be really big issues around exactly how to do an experiment, like, "Did you record the temperature in the room? Did you record exactly how the stirring took place?" They'll pick up a result from another lab, and they'll have a hard time reproducing it. But they can't tell if it's because it's a false positive or because they didn't even think to include in the methods all the details needed, either because there are so many details and nobody has the time or because there might even be aspects of what they're doing that they don't recognize as being critical to get their reaction. But in fact, it's required.

BRIAN: I would say yes and make two additional points. One is that the same arguments have been made by psychologists for the exact same reasons. There are so many conditions here. All of these conditions are underspecified, but you just need to know all these conditions in the lab. Look at how the original lab does it, and then you'd be able to find it, even those things that are unrecognized. That can be true, and it may well be true in many circumstances. The other part is that in the project we did, a reproducibility project in cancer biology, there were definitely occasions where it was novel techniques. I'm not a cancer biologist, so this is me reporting from those more expert than I. There were cases where it was clear that this was technique-driven. There are things happening here that are complicated, and it may well be replicable, but you can't even do a good faith replication. Many of the projects there, we never even got to running the experiment or using any results because some of the preliminary steps of just showing that the technique could be performed as it was reported would fail. We've also observed failures, like we did in psychology, on very well-worn techniques, techniques where it's so routine that one wouldn't easily attribute it to those expertise or subtlety effects. I am still undecided whether there is a substantive disciplinary difference on that or if it is a natural part of a field starting to wrestle with replicability issues. The first thing one says is, "Well, it must be because the replication did something wrong," and that could be true for some subset. The next question would be, of course, the overreaction of, "Oh, it must be that the original was wrong. That could be true; it might be true for some subset." The third reaction should be, "Oh, maybe. But let's look. Here's a new question to test."

[promo]

SPENCER: One thing I feel is a real bright spot is it seems that at least in psychology, more and more researchers are making their exact materials available. For our project, transparent replications, where we replicate new papers right after they come out in top psychology journals, we find that more researchers are actually making the materials available. I think that wasn't true so much 15 years ago.

BRIAN: It has definitely changed those norms in psychology and in many fields. Now there is more open data, open code, open materials, and open protocols. These are becoming more popular across many different fields, and I think that's a great step. A key part of the Open Science movement, including pre-registration and other activities, is that it creates the occasion to evaluate the research more effectively. There's a difference between whether you can assess it and whether it is correct. Research that is more transparent is more valuable. We can actually look into it. We can say, "Oh, I see now the process that you went through, and I can critique it more effectively, which will facilitate the self-correction process." The potential irony of open science initially might be that more openness is productive in identifying more problematic parts of the research process. That might make us feel like, "Oh, this is actually making things worse," but making things valuable is the first step towards providing broader scale, more rigor, more reproducibility, and, most importantly, better self-correction. To me, one of the key parts of science being productive is to embrace the fact that it is a social system. This isn't ultimately about individual experiments and whether this one finds the same as that one. It's about how we facilitate the social system of science so that others can evaluate what you did, offer alternatives, test extensions, and build on it in real ways, so that the social process of science, with many different actors evaluating evidence and challenging it, can work as effectively as possible.

SPENCER: Besides open materials and more data being released, what other bright spots do you see where you feel there has really been progress?

BRIAN: I am very bullish still on registered reports. This is the publishing model where peer review at a journal happens prior to knowing the outcomes of the research. The reviewers evaluate the research question, the research design, and the methodology that will be used. They might look at the preliminary experiments or exploratory studies that provide a basis to say there is something reasonable here to approach. The journal makes a commitment prior to knowing the outcomes that it will publish it regardless of results if you follow through with what you said you were going to do. What makes me like this model a lot is that it is directly shifting the reward system in a way that I think is more values-aligned with how we think about scholarship. When you don't know the results and you're making a decision that is the key decision for the researchers' reward — whether they get a publication — when that decision is based not on the outcomes they produce, but instead on the quality of the questions they ask and the assessed quality of the methods they use, then you align their reward with what we hope researchers prioritize, which is asking important questions and designing rigorous methods to test those questions, rather than chasing the next oddball finding that would be exciting enough to obtain a newsworthy result.

SPENCER: My own experience in thinking about science, I tend to think that the way science progresses usually happens in a rather messy way, where people are kind of poking at an empirical phenomenon, running lots of studies to understand it in different ways, and for quite a while, they don't really understand the phenomenon. Eventually, with enough studies, they start to develop a sense of how the thing operates. Eventually, this turns into a theory, and then you can do a confirmation of the theory. You can get a nice, "Okay, here's your confirmatory study." However, it seems to me that the visible part of science often is just that last bit; the theory seems to come out of nowhere, and then someone runs a confirmatory study. But there's this huge invisible part, which, to me, is actually kind of the bulk of science, which is that messy, "Let's figure out how this phenomenon works. Let's do lots of different research to figure it out." Curious to hear your reaction to that.

BRIAN: I certainly have the same experience as your perception there. I mean, most of the time in the lab, we spend fumbling around, not having really any understanding of what we're investigating. Lots of false leads, false starts, things like, "Oh yeah, this will work, and oh no, okay, no, we don't have that, okay, now what?" And that's also part of the fun of this. We're pushing at the boundaries of knowledge. We're going into areas we don't understand. That's why we're venturing out there. We're going to be blind for a while. We're going to fumble around. We're going to start to make some sense of things. We're going to have additional false leads, and then it'll get cleaned up over many investigations, often across many different people, rather than just a subset. So that is, to me, ordinary, good, and in fact, necessary. If we were always confirming mature theories, then I think science would get really boring, really quickly. The discovery part is really where the action is in order to identify new possibilities, and then the follow-on work really works to refine, and I find great joy in that as well, but it would feel like a loss to not be in that discovery mode. I think the main challenge for research in general is that the reward system is not aligned with that. The reward system is really rooted in the scholarly research context, in papers as the currency of advancement, and it's rare that that exploration and discovery process happens at the cadence or the pace of the production of papers, but papers are more rewarded in the standard form of evaluation, from checking all the boxes of having completed all of that discovery. So I would think that our progress would be better if we can surface what you were saying is largely invisible, and we can reward it. Because even, as we were talking before about how it's hard to communicate the uncertainty in the public setting, it's also hard in the scientific setting, because reviewers, when they see the uncertainty, "Oh, you haven't figured that part out." The easy feeling is to say, "Well, then I reject your paper. You need to figure that out before we should publish it." I would much rather have a system where it's evaluated based on whether this was done well, and if it's done well, then we'll publish it regardless of how much has been figured out, regardless of what kind of mess is still there, because that would better align, I think, with exactly how science is actually happening on a day-to-day basis, and not skew the rewards to be for a subset of what happens in the lab, as it were.

SPENCER: Do you think there's too much bias towards wanting to publish things that are intellectual achievements, rather than real contributions that may not be intellectual achievements per se? Just to give an example, I tend to find it helpful and interesting if a paper just says, "Look, we found this really unusual phenomenon. Here's what we did to show that it's real. We don't yet know why it's happening." You might say that's not really an intellectual contribution. It's just an observation. Yet, I think that kind of paper could be really important. So, yeah, I'm curious if you think that there's too much emphasis on the intellectual component.

BRIAN: Well, so if I understand your meaning, it's sort of by intellectual achievement. You mean, "I figured it out?" Is that what you mean?

SPENCER: I figured it out exactly.

BRIAN: All right. I think you're right. I think that is the aspiration. And of course, it's fine that we want people to figure it out, but embracing the reality that oftentimes we end with more questions than answers. I'll give you an example of my own, Jordan Axton, in my area of research, which is implicit bias. We set up this new paradigm where we're investigating implicit bias in social judgment, and we found in the first experiment of classic investigation of potential race bias in selecting people for an honors committee or something. It was a simulated environment, but we found a clear and unambiguous pro-black bias, meaning that people were favoring black candidates over white candidates. In this area of research, most observations are that white candidates tend to be preferred over black candidates when you create subtle expectations or differences. We said, "Well, okay, this is weird. This is against what is mostly observed here. Maybe there's something about our paradigm that produced this reversal." So we did it again. We did it again. We did it again, varying things each time we did the experiment. We thought maybe people are intentionally being affirmative actors because it's a performance context. Maybe they're trying to give additional credit, and they know that they're doing that for black candidates. We give rewards for them to be accurate; we tell them about the bias that might be coming. We could not get rid of this effect in this particular paradigm, favoring black candidates over white candidates. So that's what we published. We said this is interesting. We don't quite know how this is emerging. This is what we tried. We don't know what to do with this in the context of other findings in this domain, but we think this is important because this is about seeking conditions under which you observe a phenomenon or not. Here is a condition where we see a very reliable pro-black effect that appears to be unintentional. If we're going to understand the theory around bias and how that gets applied in racial contexts, we're going to need to understand these conditions as well as those that show the reverse.

SPENCER: I'm glad to hear you were able to publish that, because to me, that's a real contribution. Saying, "Hey, we're getting this unexpected finding. It goes against other findings, even if we don't yet know why that's the case."

BRIAN: Yeah. In fact, I think we submitted to Social Cognition, if memory serves, that was the first journal we submitted it to. I could be wrong, but it was successful through the review process.

SPENCER: Speaking of the Implicit Association Test, could you just briefly tell the audience what that is? And I'd be interested to hear how your thinking on it has evolved as more has become known about that test and its strengths and limitations.

BRIAN: The Implicit Association Test, or IAT, is a response time task that tries to measure the strength of associations of different categories. For example, imagine that you have a deck of cards, and instead of four suits, you have pictures of young people, pictures of old people, words that mean good, like happy and joyful, and words that mean bad, like terrible and nasty. Those cards are all shuffled up, and you have to sort them into two piles as fast as you can. In one case, you sort the cards, putting all of the good words and young faces together and all of the bad words and old faces together. Then we shuffle up the cards and ask you to sort them again, but this time they are switched. Now you put the young faces and bad words together in one pile and the old faces and good words together in the other pile. The key measure in the IAT is how much easier it is for you to sort in the first case versus the second case, putting young with good and old with bad versus young with bad and old with good. In that particular example, most people, no matter what age they are, find it easier to pair young faces and good words, putting them in the same group, than old faces and good words, regardless of their conscious beliefs and intentions.

SPENCER: That's a great explanation. And then this has been applied across many different domains. You could do it across race or age or gender.

BRIAN: Yes, it's a very general task. You can basically take any four categories, ideally on two dimensions like good versus bad, self versus other, or tall versus short, and pair them against another dimension like black versus white, young versus old, male versus female, or even non-social categories such as pants versus skirts or cats versus dogs. It has very general applicability, and that's made it a popular research tool because it tries to assess how strongly we associate concepts together without requiring people to introspect to decide how they feel. The measure here is simply how fast you can categorize those concepts together versus the contrary concepts. That indirectness has made it a very interesting complement to just asking people what they think.

SPENCER: Because, in theory, it would be really great if we had a way of assessing bias that didn't require just asking someone if they're biased, where people might be very reluctant to give a truthful answer, or it might even be self-deceiving. They might even be aware that they hold these biases.

BRIAN: Yeah, exactly. It's an interesting history of how implicit measures emerged. They emerged partly, maybe mostly, in the context where researchers thought, "Oh, people have reasons to hide things." In the context of prejudice or stereotyping or other places where people might not be willing to say directly what they think, you have to look for other ways to understand what they think. And of course, that comes with a long-standing interest in behavioral sciences for lie detection. We can get the real truth if we can only avoid people's self-presentation concerns about what they say they believe and instead look directly into their minds. Very quickly, it was clear that it is not a lie detector. It doesn't fit into that framework at all. I could tell you that with confidence based on my own performance on it. Whenever we would develop a new measure, whether it's for racial bias, age bias, or gender bias, I would do it on myself first, and I would show every single bias we would investigate. "Oh, my God, okay, I got that one too." That was always just a really instructive experience of, "Oh, that's really interesting. I don't agree with this, and yet it's coming out of my hands, which appear to be attached to my brain, which means that somehow I am having a harder time associating these things with those things. I wonder why that is." I think what's made it a really interesting area of study is that the experiential phenomenon of it is interesting and sort of demands, "Wait a second. Why is this happening?"

SPENCER: A lot of people have been given these kinds of tests, especially in the context of trying to understand their own racial biases, maybe as part of racial bias training. I think Malcolm Gladwell might have talked about this in one of his books. Then there's the question, though, of whether it really is interpretable at the individual level. Some people have challenged this, saying, "Well, if you give people this test now and you give it to them again in six months, you wouldn't really expect their true implicit biases to change over six months in any given random time period." Yet, the test-retest reliability, as I understand it, is fairly low, maybe a point four correlation or point five, something like that. So, yeah. What's your thought on that as a tool for individual self-understanding?

BRIAN: So this goes back to what we were talking about earlier, which is the uncertainty. I think the popularization of it has been very productive in the sense of getting people engaged in what is a very basic research topic on how we investigate people's minds without requiring them to tell us. The excitement about it as a phenomenon has made it get translated into ways that go way beyond its validity in terms of the evidence for it. I would argue it's been one of the most productive research tools, and there are tons of areas of controversy that are still active areas of controversy. It's one of the most overused research tools in terms of being over-interpreted regarding its certainty. You mentioned reliability as one, and that is certainly correct. It is not at all a perfect, really reliable measure. It's not even clear if the concepts that it measures could be perfectly reliable. There's a constant interest in psychology about what is a trait versus what is a state. Mood, for example, we know mood changes over time, so it's no surprise that when you ask again an hour later, you might get a different answer than what you got before. Personality feels like it's more of a state, but maybe it's more conditional on different circumstances than we thought. What implicit measures assess, whether it is a trait or a state, is itself an active area of controversy and investigation, and we don't even know what the highest reliability we should observe is.

SPENCER: As you're saying, if it actually fluctuated from hour-to-hour, day-to-day, what people's level of implicit bias was, then even the best measure tool in the world would show a relatively low reliability, potentially, if you test six months apart.

BRIAN: If it is, for example, mood, then you wouldn't expect it to show much test-retest reliability. But what's really interesting for me as a research topic is that you could start with the concept of, "Oh, I think it's a state, or more of a state," and you could drive a research agenda with that assumption and try to figure out what does that mean for how these implicit measures might be relevant in everyday judgment, thinking, decision-making. Or you could start with a mindset of, "Oh, I think it should be a trait, and therefore you would expect X, Y, and Z." Or you could start in a totally different space, which is, I want to try to figure out if it's a state or a trait. People pursuing research in this area have to decide to hold something constant in order to investigate whatever it is they want to investigate. They hold some things true, and then they investigate the other things. But other people will say, "I don't agree with the things you held constant as even being the right way to think about that measure." So I want to investigate that, and you have this interesting dialogue between fundamental assumptions, for example, of what does it measure, and other people investigating what that measure predicts. Because they're approaching it from different angles, it is an interesting challenge to try to figure out what it all means. Let me make this a little bit more concrete. It is still an active area of debate to what extent do people have actual awareness of what the IAT measures. There have been some positions, I have been among them, that have thought, "You know what? This is largely stuff that happens outside of awareness. What gets assessed by these measures." And others have advanced evidence that people have awareness of what they are going to show on the IAT. You describe the IAT to them, you show them all the categories, how it works, maybe even give them practice with a different version, and then predict what they're going to show on this. People can predict with relatively good accuracy in some conditions what they're going to show, so that challenges this simple notion of it's fundamentally outside of awareness.

SPENCER: There is a correlation between implicit bias and explicit bias, right?

BRIAN: Yeah, so I've done a lot of research on that, where sometimes your attitudes are aligned with what you show on these measures, and sometimes the attitudes you report are not aligned, and that varies a lot by the topic you study, for example. If you measure people's implicit attitudes towards Democrats and Republicans in the US, it corresponds around point seven with what people say their preferences are between Democrats and Republicans, but much less so for young versus old. There's a really rich literature about trying to understand under what conditions do you see strong versus weak relationships. But just back quickly to the unawareness question, the fact that people can predict their responses under some conditions might be seen as saying, "Oh, so they're aware of them." But if you think about it for a moment, you realize the awareness question here is a little more complicated. If I had been asked to make that same prediction of my brother, whom I know very well, I might have been able to predict his scores, too. The awareness in that context, if that's true, isn't really about special introspective access. It's that I have more knowledge about myself, so I'm applying that knowledge just like I have more knowledge about my brother than if you told me some random person to predict what they will show on the IAT. What I love about this as a research area and as a paradigm is that there's always the next question. It's like, "Oh, okay, wait, is it unaware or is it aware? Okay, well, what do we mean by aware? Okay, well, then what? How do we unpack that?" Simultaneously, there are people trying to use it to say, "Okay, how is it that this predicts behavior or otherwise in the world?"

SPENCER: There's also a fundamental question about what it is really measuring. As I understand it, there are different perspectives. Some people think, "Okay, this is the race IAT that is really measuring essentially implicit racial bias, which might lead to people mistreating others or not hiring them for roles that they're qualified for, et cetera." Whereas others say, "No, this is just tapping some kind of general association that you have. You can have an association without actually holding a bias against people. You're just aware that your society has that association, etc." So what's your thought on the interpretation?

BRIAN: Yeah, there are a million possible interpretations. Some of them are interesting challenges because they confront what we mean by, for example, a true bias versus an association. Do we mean that the true bias is something that people secretly hold, so it's back to that lie detector explanation, whereas an association is just stuff that exists in the mind through experience? Or do we mean, if it's a bias, it means that it can have some impact on judgment or behavior, and if it's just an association, it's inert on whether it actually has an impact on behavior? So part of the challenge in trying to understand what it is measuring is really getting specific in the conceptualization of what we mean when we say bias. Are we inclusive? How are we differentiating that from association, for example? The other part is that there are many ways to investigate this question, and none of them are very definitive. I would say that the field has made progress and has not resolved these debates on what it is measuring. What we have now is lots of different evidence that constrains potential theories without resolving which theory is correct. Some of the ideas are that it measures differences in association strengths. Others are measuring salience differences. Things that tend to attract more attention when you put them on the same side will tend to reveal associations because of their common salience, rather than something conceptual about them or methodological reasons. The way that the IAT itself is structured may account for some or much or all of the observed effects, but what we have are some pieces of evidence that make some types of interpretations more plausible than others. You mentioned before the relationship between implicit measure responses and what people say are explicit responses, the fact that the IAT, for example, correlates very strongly for Democrats versus Republicans assessments or pants versus skirts assessments, or pro-choice versus pro-life assessments, and very weakly for young versus old, and then in between for things like black people versus white people, demands that whatever that explanation of what it's truly measuring has to account for the fact that there's variation in when those correspond well with what we say versus when they don't. Is it resolved? No, it's not resolved. But we do have a lot more anchors over the last 20 years of research to help figure that out.

SPENCER: At this point, given all the uncertainties in its interpretation, what do you think are really good use cases for it, and what are things that people shouldn't do with it because it's just not a valid use case, even if people want to use it that way?

BRIAN: Well, it is still a productive research tool. That is a good use case. It is providing a method that people with different orientations to things might get. This is a way to measure learning that's distinct from just asking people what they learned or what they've been exposed to, for measuring and trying to understand people's conscious experience and what they say they believe versus what they show when you just look at these kinds of response competition assessments. That's still a very productive area of research. I see lots of research applications that are useful. Where it runs aground and has never had the validity to be used is in diagnostic contexts for making decisions about people based on their scores. The reliability you mentioned, the level of validity that it shows, and the amount of knowledge that we have about it are all reasons to oppose, not just be reluctant, but oppose, for example, enthusiastic people who are just saying we can use this to do good things in the world. Wanting to use it for selecting jurors would be, in my view, bananas, because of all of the limitations of the measure that are not really aligned with what we're doing in a jury selection context or using it in employment decision-making contexts. None of those selection uses go... I don't know. I can't think of a use case in a selection context where I'd say, "Oh, yeah, okay, it's reasonable to use an IAT to make a decision about a person based on that."

[promo]

SPENCER: I put a question out to my audience. I told them that you were coming on the show and asked what they wanted to hear about. Quite a number of people asked about something that I imagine is kind of a sensitive topic, which is this recent retraction of a high replicability of newly discovered societal behavioral findings is achievable. I would just be curious to hear, to the extent you're comfortable sharing, what's your retrospective on what happened there and what we can learn from that.

BRIAN: The short summary is we published a paper. This is me and I think 17 co-authors. That was the outcome of a many-year project that started in 2013, and we published the paper in 2023. Data collection finished in 2019, I think, where it was a round-robin replication project. Four different labs were doing their discovery research, all that stuff that happens behind the scenes, and when they thought they had made a discovery, instead of just saying, "Okay, time to write it all up," it got entered into this round-robin replication project. The idea was that I think I have found something in the lab. Now, to be in this project, I have to run a confirmatory study, and then each of the other labs, and then my own lab, will conduct a replication of that, and then we will amass evidence of to what extent we can replicate these initial discoveries. The paper was challenged as published in 2023. Some attentive readers were surprised to read a statement in it that said every analysis reported here in the individual experiments. There ended up being 80 experiments because the confirmatory study and then four replication studies for 16 different new findings. Every one of those experiments was pre-registered, and the meta project, these analyses aggregating all of the results across these 80 experiments, was pre-registered. That was a definitive statement that is false. It was in our paper claiming that we pre-registered all these things, so these reforms that we've been talking about and promoting. They raised that in a commentary, saying, "In addition to substantive critiques of, we don't think you can conclude that you observed high replicability for X, Y, and Z reasons. We think the design wasn't appropriate for testing that question for these reasons." There's lots of substantive critique in the ordinary variety, but it opened with this observation of a fundamental critique of failure of this pre-registration statement. It doesn't seem to be aligned with what's true, and they were absolutely correct on that. There was follow-up saying that that has deeper problems, that not just that statement was wrong, but it may undermine some of the interpretations of the overall research project. That led to a review of the paper, and the journal ultimately concluded that it needed to be retracted and rewritten and submitted again to address the concerns that were raised by the court. It is, in many ways, a sort of fundamental experience, because it's on topics that I care deeply about. A real purpose of the project as a whole was to adopt as many rigor-enhancing behaviors as possible to try to make these findings replicable. The paper makes this overclaim on rigor-enhancing processes that weren't true for the meta part of the project. It's perfect irony. It's one of those elements that's funny and sad. That to me, who is a person that champions this every day, failed to do it in a fundamental way in this paper. That retraction has occurred, and there have been a lot of different reactions to it. One of the challenges is that there are real fundamental problems that I agree with for what needs to be addressed in the paper. There's really interesting substantive critique that is, in my perception, in the normal variety, and then there is a set of things that were incorrect inferences of what the project history was that have complicated the assessment of why this paper should be retracted in specific, but what the problems with the paper are in general. For me, the interesting learning from this is wanting to really lean in and own that fundamental error that I think is a very serious one. A misclaim about something that we say is an important way to help evaluate evidence. We said something's pre-registered; a reader should be able to trust that that's, in fact, true. For some of the analyses, it was not true, and they might have assessed that evidence differently if we had not inserted that false claim. While embracing all of that, how to manage discussion of the parts that we didn't agree with as the basis for retraction that were about the origins and purposes of the project. That's sort of been happening over the last few weeks in terms of a public discussion, but it has been very useful, I think, in the broader sense of amplifying that no one is perfect. Error happens everywhere. Even big errors happen everywhere. A real value of the system of science is that it is open to scrutiny. Since starting with implicit bias research in 1998, I, with my colleagues, created the website where people can go measure their implicit biases, Project Implicit. Since then, I have had the fortune, and it truly is a fortune, to have dedicated critics that are scrutinizing, questioning, and skeptical of the work that my co-authors and I do and report, and that has made my work so much better, in the sense of them finding errors, including the ones just in the past two weeks being reported. Since the origins of my work, it has helped me and my lab be better at anticipating what the critic will say, what are they going to say, where the weaknesses are here? Having that is so helpful for improving the research process and getting more confidence. Even so, in 1998, I did that. I got to grad school in 1996. It's been almost 30 years that I've been doing this. Even so, I make serious errors in every piece of research that I'm involved in, and I will make serious ones in every future piece of research that I'm involved in. Recognizing that, what it demands from my perspective is to share and be as open as possible, because knowing that there are errors there, I hope that I catch as many of them as possible. I hope I catch the fundamental ones, but really, we rely on each other to root that out so that we can constantly reduce that uncertainty and make actual progress.

SPENCER: I appreciate your openness talking about this. I'm sure it's difficult to talk about, and obviously, when people are criticizing you, that can bring up a lot of difficult stuff. I think one concern people have is, was it knowingly said that this was a pre-registration when it wasn't, or is this the kind of thing where, at some point, over a really long project, someone inserts that and the people that know that it wasn't don't realize it, etc.

BRIAN: Right. That's exactly the point where there is a difference. The errors occurred then. But one of the things that has been offered is, well, we think what this means is that they planned to do the research about one question, really, they changed the research objectives after the results were observed, and the pre-registration point itself might have been inserted intentionally. That's the part where I disagree. No, that's not the case. We didn't change the purposes of the research while simultaneously trying to figure out how to make it very clear that what we did is problematic. It deserves the scrutiny that it received. That's been sort of an ongoing discussion on social media of how people are trying to make sense of that.

SPENCER: Because part of the narrative is that that research project originated as a study of decline effects, as you repeatedly test the same phenomenon, and then somehow that kind of morphed, over many years, into just completely changing the purpose of the research. So you're saying that's not true. That wasn't the original purpose of the research.

BRIAN: So the original purpose was to evaluate the decline effect, which is what we recognized in 2013. What we were investigating is do effects go away? Initial evidence is more positive than the subsequent evidence when we try to replicate it. So the decline effect, as we represented it, is the complement to replicability. But the interpretation, that is, the skeptical interpretation, was that that was a change of goals. That really was about whether the hypothesis is that the effects will go away, and then you change to a different thing of investigating replicability, whereas, from my perspective, in doing the research, the high replicability is the assumption of what happens if there isn't decline. But there's a lot more unpacking there because there are different measures. One of the challenges that really raised reasonable concerns is that a portion of the meta project analyses were pre-registered, and those were testing this inferential test of a model of declining effect sizes, and the descriptive analyses that we had evaluating replicability as complements to that, what percent of the replications showed statistical significance in the same direction as the original? What was the effect size of the confirmatory study compared to the replication study? Those were not in the pre-registration. And because those were reported and reported prominently, it provided a reason for people to say, "Well, that's weird." The things that were pre-registered are not as prominent. The things that were not pre-registered are more prominent. Ipso facto, it must be that they changed the purposes. I can totally understand why people would generate that hypothesis.

SPENCER: Oh, that's interesting because some people are framing it almost as though it was a sort of mystical, original hypothesis that somehow, if you just repeat an effect over and over again, it goes down. But it sounds like the way I'm interpreting this, you're just talking about a very ordinary phenomenon where very often, when an original research team gets an effect, if you then have another team try to get it, they get a smaller effect size. And that's presumably due to things like regression to the mean, or maybe it was underpowered in the original research, and so they couldn't have detected it unless they got an unusually high effect size, etc. So you're saying essentially that what you ended up studying is actually very close to that original hypothesis.

BRIAN: It's both, that adds the complication here. So Jonathan Schooler, who is at the University of California, Santa Barbara, organized a meeting in 2012 for the decline effect. It was people like me, Lisa Nelson, Jon Krosnick, and others from ecology, biomedicine, and parapsychology, all talking about their experiences with failures to replicate an observation of decline, initial positive evidence going away. The discussions at that meeting were really about what are the potential causes for this. Lisa and myself, Krosnick, folks like us, would talk about regression to the mean, selective reporting, P-hacking, you can guess who talks about what, and small effects or small sample sizes, and those are what we would characterize as the conventional explanations. But there were also people from the parapsychology world offering explanations that are about the cosmos and the effect of observing a phenomenon on making that phenomenon go away. Those are extra-scientific; they're things that are outside of what we understand of physics. Schooler is one of my favorite people, and one of the features of him that I really appreciate, even though it's quite different from me, is that he is open-minded, more open-minded than anybody I've ever met. He thought these are hypotheses worth entertaining, that it may be that merely observing the phenomenon makes it go away, and while the others of us did not think that was plausible, the design of the project was such that we incorporated features that would enable the test of the questions that he thought were worth testing. That was this inferential test that the pre-registered part of the meta project did anticipate. So setting up those analyses to test whether observing the effects makes them decline, whether just having the effects repeated makes them decline simply because the universe is getting used to some entropy, that the effects aren't real anymore. I don't think the explanations are sensible, so I can't actually represent them effectively.

SPENCER: But you felt you could study both the kind of conventional explanations which you think are likely and these other explanations simultaneously in the same kind.

BRIAN: Yeah. So the design that we arrived at was, let's do all of the conventional solutions that we think would get rid of challenges to replicate findings, and then within that, build in this ability to test if we've removed every conventional explanation. If decline is still observed, then wow. Jonathan has something big to talk about: why is it that this is still happening when you pre-register and you're doing all these things? On the flip side, if replicability is observed when you put all of these things together, then we at least have a proof of concept that we did discovery science. We did all of these things. High replicability is observed and it is not randomized. It was all everybody getting the same intervention. But we offered a hypothesis that is the one that sort of was from my orientation of doing these things will lead to high replicability. It was very useful, especially in the context of 2013. These are all very new ideas, at least in our field, of doing these behaviors to try to address replicability. It's at the same time that we were doing projects like the reproducibility project in psychology that we talked about earlier. For me, this was sort of perfect. It's like, "Okay, we have these new ideas. We kind of need something to give us some basis to have some confidence in whether they are associated with better replicability." This is sort of the pair to the reproducibility project that's just looking at how things are happening out there in the published literature. It was very appealing to buy into that and participate in the project. I didn't at all mind Jonathan's hypothesis that he wanted to test, as long as it would involve a way to try to do this type of demonstration case for achieving high replicability. Actually, I would want to know if we hadn't observed that, like, "Wait, hold on. Even in this ideal context, we're not seeing high replicability. We need to really now look and scrutinize these solutions," which we should be doing regardless, but that would really have slowed me down.

SPENCER: So do you think that pre-registration issues aside, do you think that the paper does actually demonstrate that if you use these good practices, you can get a high rate of replicability?

BRIAN: I am more confident than others in terms of what the evidence suggests, and I think there are very reasonable ways to contest that conclusion on a variety of dimensions. One is, what can you conclude from the design? There are very reasonable critiques of that. A second is, what is the role of the work done prior to the research that was included? As I described, every lab was doing its discovery science, and once they thought they found something, that's when it entered this evidence base. The idea of this project was, let's eliminate every reason for decline that we can think of, and so the evidence base is pure in the sense that a person comes to it saying, I believe I have found evidence for a phenomenon, but I don't get to use any of that evidence to support it. Instead, I have to confront it with a confirmatory test and the replications that follow. Very reasonably, what happened in that earlier stuff is a driving question, and some of the teams are working on that. The last thing that is a very strong area of contestability is what's the right way to evaluate replication success. This project in particular had some challenges that we didn't anticipate at the outset. At the outset, we just sort of said we're going to use the reproducibility project's same criteria, basically. We don't need to think about that because we already have those laid out and what that project is doing, but we should have thought harder at the outset because the context for this study was different than the reproducibility project. It turns out that there are many choices to make, surprise, surprise, many choices to make in how you decide to assess replicability that we had to wrestle with at the end of the project because we hadn't anticipated that at the outset. The claim that that had been pre-registered just added fuel to the criticism in reasonable ways.

SPENCER: Is this where this idea that you'd looked at the data before coming to some of the conclusions came from? Because you didn't have a clear enough idea of what your criteria were going to be until you collected data.

BRIAN: Right. So the overclaim was that we weren't planning to look at those criteria at all, and that's not correct. From right from the outset, we had known we would be reporting on criteria like the rate of statistical significance, comparing effect sizes of confirmatory studies and replication studies. But what we did not do was realize that there are many choices to make about how to actually calculate those. The team wrestled with that during the writing in ways that are ordinary for non-pre-registered research where it is exploratory. The reproducibility projects, for example, also have no pre-registration and never claimed to have a pre-registration. For the meta project, all the individual experiments were pre-registered. The aggregation is better understood as being exploratory for how to decide what replicated or not. That really should have been prominent in this paper, as it was modeled after the reproducibility project in aggregating those. What I wish we had done, which we didn't do, is we debated which of the criteria to report in the paper, given the various ways you could conceive of replicability, and then selected a subset. It would have been better, given the critiques that have come up, which are very reasonable critiques, to just report every single one of them that we could have thought of, and say, "Here's what they all show," and then talk about why we think some of them are more defensible than others if we wanted to make claims about that, but that's not what we did. That was a mistake.

SPENCER: My final question about this is just mechanistically, do you know how the false pre-registration got into the paper? Was it someone put it in there and had a misunderstanding and then it just never got removed? Or do you have an understanding of that?

BRIAN: I can't say specifically. We posted a statement about what we understand of the history, and we were able to document that the claim was not there in the original submission. It just said all of the individual experiments were pre-registered, and it was ambiguous about what was pre-registered about the meta project. Then in the next revision, it now says everything, including the meta project, was pre-registered, so at least I know that it got in there. What is shocking to me is that I didn't notice it. I can't believe it. False arguments and false claims get into all of my papers all the time. When you're writing, it's messy. You have lots of people, all kinds of things are happening, but the editing, revision, and verification processes that are supposed to root all that out ideally do as much as possible. This one is so fundamental that the fact that I didn't catch it is the most fundamentally embarrassing part of it. There should have been no doubt, reading the paper for the 15th time, to read that sentence and say, "Wait a second. We didn't do that. Why are we saying that?" And I didn't do it.

SPENCER: I appreciate you adding that. Everyone makes mistakes, and it's just a question of how do we correct them? How do we find them? How do we make sure they don't get into the published research?

BRIAN: It is just a matter of that. But also, some mistakes are bigger than others. And so that's a part, I think, why this has gotten the attention that it has, is that this was a big error that shouldn't have happened.

SPENCER: Of course, people naturally wonder if it was done on purpose, right? And that's a huge difference between an innocent mistake where someone put it in there and didn't realize.

BRIAN: Right, of course. And I totally recognize why that would come up.

SPENCER: So the final thing I want to ask you about is, what are you excited about for the future of open science and improving science? Maybe you want to touch on the life cycle journal project, which I know looks like an exciting project.

BRIAN: The life cycle journal project is my favorite new idea of things we've been wanting to work on for a long time but have not had the resources or bandwidth to do. Most of our work at the Center for Open Science from 2013 to today has been working on tools and services to support the research producer. How do we help the one doing the research make it more open, more transparent, and easier for others to engage with? What the life cycle journal is trying to do as a pilot is connect the research producer across their research lifecycle with the research consumer, in this case, the consumer being other researchers evaluating that research. What we're going to do with this pilot project is essentially run peer review and other types of review across the entire life cycle of research projects as they happen. If you're an author, like the registered report model, you get to submit right at the beginning of the project, and it is evaluated. All of that is transparent, and it occurs during the entire project life cycle, including after you've written your report. If you find errors, as we found in ours, that correction process can be built right into revising the manuscript as needed. The additional features that I think are particularly interesting for this test are one, that it's not just about the paper; it's about the entire research process and all of the contents, so data, materials, code, the pre-registration, the plans, whatever it is, all of that will be part of what is the scholarly content of the contribution. It won't make the paper the first thing; it will treat all of this as first-class scholarly outputs. The second part, which I think is the real opportunity for innovation, is that it's not just going to be peer review. It's going to be all kinds of different evaluation services, experimental ones, and ones that have been maturing, that are all interrogating the research at the appropriate place in the research process. For example, that might be registered report review by peer reviewers during the planning stage. Once you've committed to your plan and you're going to do the research, maybe a prediction market opens up for people to make predictions about what's going to happen in your research and what the implications would be. Once you complete the research, maybe a service that evaluates the quality of your data sharing will evaluate that and give you a report on things you can do to improve your data sharing. Following that might be AI tools that assess whether they predict that your findings would replicate, and likewise, another group, like the Institute for Replication, might try to reproduce your findings by taking your data and your code and giving an independent report on the reproducibility of your findings. The idea is basically to create a marketplace that individually and collectively provides much more insight for authors on how to improve their work, for readers on how to assess the work, and for ultimately trying to understand the reliability, credibility, and trustworthiness of findings to go well beyond what our current very simplistic criteria are, which are whether it is published or not, and how prestigious the journal is that it is published in.

SPENCER: The criteria to get good jobs in academia are generally based on your publications in top journals. How do you see this fitting into that kind of incentive structure? Let's suppose that the pilot project goes well and it gets scaled up.

BRIAN: The key part of it is that it needs to meet the current reward system as it is, and then all of the additional things are an addition. The way it meets the current reward system is that people can publish their work in this system across the life cycle. If they decide at the end not to, the author gets to decide whether to make their contribution the version of record in the life cycle journal. What that is, is a standard mechanism in the world of scholarly communication to decide that this is now a paper. It's finished, it's published. If an author decides to make a version of a record from their contribution in the life cycle journal, then it's published there. It's a life cycle journal paper with all of these evaluations. If they decide not to, they can submit it to any other journal they want. This record is available and public, and it's like a peer-reviewed preprint with lots of different types of reviews. This way, researchers don't have to take any risks on the standard system, but simultaneously, we can start to work on gaining the credibility of this alternate system so that people don't feel like they need to move that paper to a standard journal at the end of it. They might initially, and over time, a success metric would be they no longer see that as relevant. They're getting the evaluation and the reward from going through the process itself.

SPENCER: I see, so even if it doesn't get published in a standard journal, ultimately, as long as the committees that decide on tenure track jobs and tenure decisions think it's valid and a good sign of someone's research quality, then that would be enough.

BRIAN: Right. It will look and act like a standard journal in all the ways that one puts it on their CV if they assign a version of record, and it will have all these extra things. It's those extra things that we think are ultimately, in the long term, going to shift how research gets evaluated and researchers get rewarded more on the content of the evaluation rather than on where it's published.

SPENCER: The extra things seem to be really good ways to improve reliability and robustness, but they also are extra work. Does this suggest, if we move to a model like this, if this were to scale up, we have to lower our expectations of the volume of published work?

BRIAN: It's a great question, and I think it's a reasonable hypothesis because undoubtedly there is more work involved with more evaluation services. But there are a couple of things that make this uncertain for me at present until we get more evidence about how much effort is actually involved. One is that in the standard model, you have to submit to journals serially until you get it published somewhere. The same paper can be reviewed by four different journals, four different editors, four different sets of reviewers, which is, alone, a high burden on the system. This model removes that by having review integrated. It's one submission, as long as you fulfill basic criteria, it's in, and it's really about the evaluations. You don't have to go through this serial process. So I don't know how much the trade-off there in terms of overall effort will change, but there will be some. The second factor is exactly what you are sort of keying in on, which is if my reward is really moving towards the evaluations I receive rather than just whether I was able to get through peer review at one journal, then the reward system, the incentives and values that come to researchers for their work might actually shift from producing volume and instead move a little bit closer towards producing quality. I might be willing to spend more time to improve my evaluations of my work because I have the opportunity to do so, and I might spend more time doing that follow-up experiment to move my ratings from these different evaluation services up to the next level. That remains to be seen. We're doing this as a pilot in order to start to get some initial evidence to see what happens.

SPENCER: An observation we've had in our transparent replications project, or again, as I mentioned, we're replicating new papers coming out in top journals in psychology, is we expected to see a lot of failures to replicate, and that's not really what we've been seeing so far. It's still preliminary, but we've only had, I think, two failures to replicate so far, and in both cases, we think it's not P hacking. We think it's more subtle reasons, which really shocked us because we thought we would see a lot of P hacking. For those who are listening, P hacking is essentially when you use kind of fishy methods or change your outcome measure or drop outliers in such a way as to get a statistically significant finding to make it publishable when, in reality, it is just a false positive. So we expect to see that, but that's not so much what we're seeing. However, we are finding substantial problems in quite a number of papers, and so we've been trying to characterize what the problems really are that we're finding. One type of problem that we didn't think had a name, so we actually came up with a name. We call it importance hacking as a kind of analog to P hacking. This issue is where what they claim they found is not exactly what they truly demonstrated with their statistics. If you put side by side the claims and what they actually proved, they're different, and they're different enough that a reviewer might not have been willing to publish it if they really understood what was found. So whereas P hacking is a way to get published by essentially making results seem significant when, in fact, it's a false positive, importance hacking is a way to get published by making a result seem interesting, novel, meaningful, important, when, in fact, if they really understood what you did exactly, they probably wouldn't be able to publish it. I'm wondering, do you see a way of evaluating this kind of thing in this sort of life cycle journal project, or does this seem just something outside of scope?

BRIAN: That would be an awesome thing for a service that's entirely focused on that, how well do your claims match your evidence as an assessment? I think that would be so cool as a singular service that narrows to a very narrow objective, but it's perfect as something that's involved in this. So far we have 20 different groups with a wide variety of ways they're thinking about evaluation that they want to experiment with in this, and it is open for other evaluation services to join on, and they don't have to be long-standing commitments. This is experimental, so groups that have something like, "Oh, I've always wanted to try evaluating based on X. Let us know, and let's see if we can find a way to involve you as an independent evaluation service on X. That'd be super fun."

SPENCER: That's really cool. I really like the modularity of it. If anyone's interested in doing that, you can read our materials on our site about how we developed this rating. We call it our clarity rating. I'm happy to talk to anyone who's interested in learning how to evaluate on this. Do you see that as a significant issue?

BRIAN: Yeah, I think you're onto something. So I'd say keep on going. Awesome.

SPENCER: Brian, thank you so much for this conversation. I really appreciate it. It was really interesting to hear what you have to say.

BRIAN: Thank you for having me, Spencer. I appreciate it.

[outro]

JOSH: A listener asks: "If AI was ever determined or even suspected to be sentient or aware or conscious, what kind of rights might they be granted that humans don't have and vice versa?"

SPENCER: You know, even if it turns out at some point as we make AI they're conscious — and by "conscious" I mean that they have internal experiences, there's something that it's like to be them, they feel something, they're not just like a rock that feels nothing, they're like a human that feels something, right? And there's also the question of what would they feel? What would they experience? Like, it'd be easy to assume that they're feeling what we feel, right? Like, you know, if you chat to a large language model and you ask it how it's feeling and it's not one that's supposed to decline saying that, it might say something like, "Oh, I'm feeling great today!" It doesn't mean it's feeling great. Like, even if it was conscious, that doesn't mean that its words "I'm feeling great" means it's feeling great, right? Maybe it feels something completely different than what we feel. Presumably, it wouldn't have normal human emotions like anger unless we specifically somehow figured out how to give it those emotions. More likely, if it felt anything, it would feel something that's very different than what we feel, potentially extremely alien. It might be really difficult for us to even understand when it feels things or in what context or why. So I think if we were to give AIs rights — which I think if they start becoming conscious, I think we should do because I think they would, at that point, be worthy of ethical consideration — we would have to deeply understand what it is they feel and in what context, right? Like, for example, just to give a sample scenario: with a normal human, you would think, "Ah, well, if I were to make them spend hours thinking for no pay, that would be really cruel and mean, right?" Well, with an AI, we have no idea what the effect of that is to the AI, right? We have no idea, even if it did feel something, what it would feel in that experience. It could be torture. It could be bliss. It could be neutral, right? So it's a complicated question. But ultimately, I think we would have to basically adapt to its own nature, which we don't understand very much about.

Staff

Spencer Greenberg — Host / Director
Josh Castle — Producer
Ryan Kessler — Audio Engineer
Uri Bram — Factotum
WeAmplify — Transcriptionists

Music

Affiliates

Click here to return to the list of all episodes.

CLEARER THINKING

Episode 235: Highs and lows on the road out of the replication crisis (with Brian Nosek)

Contact Us