Episode 303: Should science stop worshiping statistical significance? (with Andrew Gelman)

Enjoying the episode? Want to listen later? Subscribe on any of these apps or stores to be notified when we release new episodes:

Listen on

Apple Podcasts

March 6, 2026

If you enjoy our podcast, we have some exciting news – we’ve just launched a new membership called Clearer Thinking Plus.

Members get this podcast completely ad-free, as well as two professional coaching sessions every month, access to our advanced cognitive assessment, and seven other exclusive perks.

Clearer Thinking Plus is one of the most affordable ways to get access to a high-quality coach - whether you want to improve your habits, find more effective ways to work towards your goals, or get assistance making difficult decisions. It is also a more affordable and convenient way to get all the perks we offer.

If you're not interested in coaching, you can still get ad-free access to this podcast and the other perks with our explorer plan.

Access www.clearerthinking.org/plus to become a member today. We hope to see you there!

What makes a piece of research “public property,” and what ethical obligations does that create for critics and authors alike? When a result feels wrong but you can’t locate the “smoking gun,” how should skepticism be calibrated without sliding into cynicism? How can a field avoid mistaking the absence of obvious errors for evidence that a claim is sound? What incentives cause entire literatures to form around fragile findings, and why do they persist for so long? Why do some researchers experience replication attempts as hostility, while others experience them as a gift? What norms would make constructive public criticism more common and less personally costly? How should we weigh a paper’s contribution when its analysis is flawed but its question is valuable? When is it rational to trust “the literature,” and when is the literature itself likely to be trapped in self-reinforcing error? What would it take for scientific communities to treat uncertainty as an honest output rather than a professional liability? Can a culture of open critique exist without amplifying bad-faith attacks or anti-science narratives?

Andrew Gelman, Ph.D., is Higgins Professor of Statistics, Professor of Political Science, and director of the Applied Statistics Center at Columbia University.

Links:

Andrew's Substack

SPENCER: Andrew, welcome to the Clearer Thinking podcast. Great to have you.

ANDREW: Thank you. Glad to be here.

SPENCER: Noah Smith once said the scariest seven words in academia may be that Andrew Gelman just blogged about your paper.

ANDREW: I think there are scarier things to be said in academia.

SPENCER: But why do you think he said that?

ANDREW: I guess he was trying to be kind of funny. I'm a very nice person, and I think when I started blogging, I remember writing something about this and saying that, "In real life, I'm kind of mean, but my blog personality is very pleasant, and so everyone thinks I'm nicer than I really am." But I guess it didn't really work out that well.

SPENCER: I think you have it in for bad statistics. Is that fair?

ANDREW: Yeah, I think I used to have it in for bad graphs, and people would send me bad graphs, and then people sent me bad papers. But sometimes people send me stuff and they want me to make fun of it, and I like it. Someone sent us this paper about 15 years ago saying that, "There were more babies born on Valentine's Day and fewer babies born on Halloween. Isn't this silly?" I looked at the paper, and it looked pretty reasonable, but they only had data on 30 days of the year. It was the week before and the week after for each holiday. So a week is 14, times two is 30. I wanted to see all 366 days, so I posted on the blog, and someone made a graph, and then my collaborator, Aki, did some analysis, and it became a big research project, all because I was nice.

SPENCER: Sometimes being nice pays off. How often when you read social science papers do you think, "Oh, this is a load of crap"? Is it 50% of the papers you read? Is it 1% of the papers you read? How common a problem is this?

ANDREW: I don't want to answer that, because there's selection bias in what people send me. The hardest thing is sometimes you see a paper and it doesn't seem right, but you can't pick out exactly what's wrong with it. You have to avoid the presumption that it's correct just because you can't find the smoking gun. There are a lot of cases where something seems reasonable, and then you look at it five years later, and the bad things were obvious, but you didn't notice at the time.

SPENCER: Would you say that statistical problems in scientific papers are really common, even though you have selection bias in the ones you read?

ANDREW: Yeah. I think there are lots of mistakes or poor analyses in my own papers, so I can only assume that other people are even worse.

SPENCER: Well, given you're one of the best statisticians out there, I'm sure.

ANDREW: Exactly. No, but yeah, I don't know. What does it mean, right? So maybe the result can still be true or valid. That's why you're supposed to kind of look at the whole literature and hope that the literature itself is not in some sort of trap where it's reifying itself.

SPENCER: How do people typically react when you point out statistical flaws in their papers?

ANDREW: I don't usually contact them directly because it doesn't work. It's not very pleasant when they do that. And this came up once because I said something like, "I don't enjoy interpersonal conflict," and that made me look like a really bad person. But I'm going to give you my official take on this, which is that if you publish something, it's public, and if you're willing to let people forward your paper and say how great it is without asking you for permission, you should be willing to let people forward your paper and say how it sucks without asking for permission too.

SPENCER: We do this project called transparent replications, where we replicate new papers coming out in top psych journals. We'll use a random number generator to pick a study and go try to replicate it. It's interesting; we get a range of reactions because we always reach out to them and say, "Hey, guess what? You've been chosen. Sorry." Then we go try to replicate their paper, and you get everything from stonewalling to super polite to trying to be polite but clearly really unhappy with the fact that you're replicating their paper. So I see this wide range of reactions.

ANDREW: Oh, yeah, they should want it. I love criticism. A few years ago, we did our Red State, Blue State project and looked at how voting varied by income in different states. People asked, "What about white people?" Everyone wants to know about white people in this country. They're always asking about that, not just all the voters. So we did an analysis of just the white people and had the maps of just whites and everybody. I posted it, and then some blogger somewhere, some political blogger, just attacked me with all sorts of things. He had about 12 points he made about me, and maybe three or four were completely wrong and three or four were completely irrelevant. But there were some real points, and it turned out, for example, that if you looked at our map carefully, we estimated that 108% of the non-whites in New Hampshire voted for Barack Obama, which can't be more than 100% in statistics. So my colleague and I went back, and we spent three months improving our model, doing something that now would be trivial to do, but back then was a lot of work. I was so happy. I didn't care that they were rude. I really appreciated that they found a flaw in my work. So, yeah, I think people should be thrilled if their thing doesn't "replicate." That doesn't mean it's not real because it can be that it's a noisy replication study. There can be real stuff happening, but yeah, they should be thrilled. Criticism is wonderful. People should value it more.

SPENCER: Well, it also points to the fact that statistics is a pretty hard subject, and you hear a lot of criticism of science these days. It seems to be almost like a movement of criticizing science, but the statistical parts of the criticism maybe don't get as much air as the sort of more tangible stuff that most people can wrap their heads around.

ANDREW: There is a balance between the statistics and the theory. An extreme example was the study from 2011 claiming to find that Cornell students had ESP.

SPENCER: Well, everyone knows that they only have that Harvard, right?

ANDREW: Yeah. [laughs] This was published, and it was presented. They played hard and played by the rules, so they deserve to get published. This was a solid paper. You look back later, it had a lot of problems, but also there were some theoretical problems in that they didn't have a very good mechanism. I think these go together; when there's a strong mechanism and there's statistical problems, that's maybe less of an issue, because then you can go and try to do better measurements.

SPENCER: It reminds me of the Daryl Bem paper, where they found that people could predict the future if the future involved pornography.

ANDREW: That was the one.

SPENCER: Oh, is that the same paper?

ANDREW: They actually followed up with a paper with, I think, 82 replications of it, which is pretty impressive, except many of the replications came before the published paper, and one of them was on spiders.

SPENCER: A spider porn?

ANDREW: Just actual spiders. They had a protocol, and they found all the studies that followed the protocol. But at least, he's had problems too, of course.

SPENCER: Well, when one of the replications failed, if I recall correctly, Bem said that it's because the pornography was not sufficiently evocative.

ANDREW: I hate to laugh at this. Let's take something more serious, which was a study promoted by a prominent political scientist claiming that flashing a subliminal smiley face on the screen could change people's attitudes on immigration by 18 percentage points. I don't think that's plausible, but there's a theory behind it. As you might have heard, it's very hard to change people's opinions on immigration. You have to start shooting people and stuff, then opinions change a little bit, but it takes a lot. So the theory is not unreasonable. It's hard to change people's opinions, but you can do an end run. People want a different opinion, but they don't. So it's not impossible. And that's the kind of study where you'd say, "I think the problem is they need better measurements. The real effect is going to be much smaller if it's there." You can't just laugh at it. You can laugh at it, but you can't just laugh at it. There could be something there, and that's where the statistics and ideas of measurement are very important.

SPENCER: One time, a friend of mine had this idea. She was like, "What if you could subliminally prime people throughout the day with just being happy, like, show them things that make them happy all day long, subconsciously?" And I was like, "Wow, that's an amazing idea." So we actually tried to work on it. We worked on it, we worked on it, and we eventually realized that computer monitors don't flash things fast enough to do subliminal messaging. And actually, the real studies on it have to use special computers that have higher frame rates. So that is a trigger.

ANDREW: I will say my general feeling is that a direct approach is a little better. If you want to convince somebody of something, it's better to talk to them about it directly. For me, I do the happiness treatment myself. I spend as much time with my family as possible because they make me very happy, and I don't need the subliminal.

SPENCER: All right, let's talk about the replication crisis. Quite a bit less wholesome. Can you explain what the replication crisis is?

ANDREW: There were a bunch of high-profile studies where there would be maybe an entire literature, like hundreds or even thousands of published papers. They'd be in psychology textbooks. They'd be in popular books. But then when people would try to replicate the study, they would fail to replicate. The key is that even if there might have been hundreds or thousands of papers on a topic, each paper did something slightly different. No one wanted to do an exact replication. When people started doing that, they were finding things that weren't replicating. Sometimes you can go back and just do a statistical analysis and say, "This never had a chance of replicating. We call those dead-on-arrival studies."

SPENCER: An interesting example of this would be ego depletion. If you looked in that literature in particular, the basic idea is that they thought willpower is kind of like a muscle, and you can deplete it. Imagine that you've got this one bar of your willpower, and as you use it, it goes down. As it goes down, you won't be able to use it later. If earlier you resisted eating the cookies five times, you won't be able to resist the cookie later on. What's so fascinating about it is there are literally hundreds of studies on this topic, and yet when they eventually tried to say, "Okay, let's sit down. Let's really try to replicate this carefully," a bunch of the big replications completely failed. That raises this bizarre question, "What on earth were they doing? How can you have hundreds of published papers on a topic when there doesn't seem to be a phenomenon under study? It doesn't make any sense."

ANDREW: I don't know about ego depletion, really, because I spent so much effort looking at other studies, so I just didn't have anything left to find. Let me give you an answer, because I have an answer to your question. But first I want to say that I haven't done a lot of psychology research. I published a bit in psychology, but mostly on methods. To the extent that ego depletion doesn't seem like the stupidest thing in the world to study, it seems worth studying. If I were studying it, what would I want to do? I'd want to measure the mechanism as directly as possible. There's a tendency in research for people to do what's in economics called a reduced-form analysis. You do the experiment, and then you look at the outcome and say, "What's the treatment effect?" I don't think that's going to work with something complicated like ego depletion. I think they have to do a lot more qualitative work on it, but then do quantitative work informed by that, which means measuring intermediate outcomes, asking for people's attitudes. I'm not saying it's easy. Part of the problem is what I've called the penicillin model of science, or the take-a-pill, push-a-button model of science, which is that there's this treatment. You do the treatment, and the most scientific thing in the world is to have the treatment group and the control group, put them in a black box, and then see what wins. That's not going to work if your treatment effect is contextual, if it's positive in some places. Ego depletion, to the extent it exists, and I have no doubt that it does exist to some extent, sometimes it's positive and sometimes it's negative. Sometimes you get that success from hard work and you want to do more. Other times you've depleted your ego. That's not going to be studied. It would be the equivalent of a pill that half the time makes people sick and half the time makes people healthy. You want to figure out when that's happening. Don't just give people the pill and look at the average.

SPENCER: The funny thing about ego depletion in particular is it sounds so much like things we're all familiar with. If you're planning a wedding and you have to make 100 different decisions about table napkins, you probably are not going to be making the best decisions later when you're choosing your snacks. I think we have all these words for that, exhaustion, fatigue, boredom, frustration, and most people realize those do affect your decision making. It's just that they took these common sense notions and tried to define them in a very scientific, technical way. But that technical sounding way doesn't seem to hold up.

ANDREW: Well, two things. One is, I want to avoid the one-way street fallacy, which is the assumption that, "Oh, why not try this? It can't hurt." Well, it could hurt. And again, with ego depletion, there are going to be times when you have anti-depletion, your ego builds up its strength from its successes, and that's there too. I think there's a very naive thing when people have an idea and then they think, "Well, it either works or it doesn't work." So let's study it, and we'll see if it works. But what if it's negative in some settings and positive? When I was a kid, I read this book. I can't remember which one it was, with four kids who were doing stuff together, and they had a housekeeper who could always come up with a snappy answer to everything. Any question they had, the housekeeper would manage to say no. The kids said they wanted to find her book. The book is the thing where they either say, "Too many cooks spoil the broth," but then there's another saying where "many hands help out." So they said, "We want to find a book that has all the sayings in both directions." So when she says one, we can say the opposite. That's how I feel when you say this is convincing. "Yeah, that makes sense. And the opposite also makes sense." But there is no book. That's the trouble.

SPENCER: To what extent do you think that the replication crisis is really a crisis about statistics in particular, versus, say, other factors of science?

ANDREW: I think that the most important thing about statistics is measurement, and the problem is that people are sometimes picking out the wrong parts of the scientific method. It's not bad to randomly assign treatments or find natural experiments. It's not bad to measure outcomes, but it's also important to measure things carefully. I think people sometimes think they do this, this, and this, and they've done science. I do have an answer, if you want. I have an answer to your earlier question about how this could have persisted.

SPENCER: Yeah, how could it persist?

ANDREW: So it goes like this: you do a study. People sometimes talk about the file drawer effect, that the studies that aren't successful go in the file drawer, and the ones that are successful get published. That doesn't happen; everything gets published. I mean, come on.

SPENCER: Even the null results?

ANDREW: Ideally, you publish a null result because it's useful to know. Forget about psychology. Think about medicine. What if you have a promising treatment idea and it doesn't work? It doesn't work. Damn straight, you should publish.

SPENCER: Are we talking about the real world or what actually happens?

ANDREW: I'm talking about the real world and what happens. They do publish medical studies saying this doesn't work. But what is often done in a lower-stakes setting is that there are enough things you can look at in the data so people find a statistically significant pattern. The statistics part here is that to be statistically significant, the effect has to be two standard errors away from zero. If you have a noisy study, meaning not great measurements, then the standard error or the uncertainty in your estimate is large. So your estimate to appear and be published has to be large. So when they said that they thought that single unmarried women, during certain times of the month, were 20 percentage points more likely to vote for Barack Obama — which I don't believe — that's because the estimate had to be at least 16%, because anything less than that wouldn't have been statistically significant.

SPENCER: That's important. I want to unpack that for a second.

ANDREW: Okay, then I'll give the rest of the story.

SPENCER: Because I think what you're getting at is that if you have a very noisy measurement, which could happen because of a small sample, you don't have that many participants in your study.

ANDREW: Or noisy measurement.

SPENCER: Or another form of noisy measurement, then basically you can't detect small effects. So necessarily, if you find anything, it's got to be large, which means that the things you see in the literature tend to be big overestimates.

ANDREW: And then the next study gets done. You're designing a study, and you say, "From the literature, it looks like the effect is about 20%, so how big a study would I need to design to do that?" You design a study following the statistical rules, which is supposed to give you 80% power, meaning there's an 80% chance you'll get something statistically significant. Then you do your new study. Well, it turns out that when you do your new study, you don't see anything, but you know it has 80% power, so you're supposed to find something. But you realize that if you analyze the data just a little differently, you do find it, and so then it can persist. I think this idea that people think everything they're doing is completely kosher because they're doing random assignments and they have unbiased measurements leads them to think the effects are real because they're statistically significant, and then that leads them to think they can do future studies of the same sort. It really happens that way.

SPENCER: So what you're touching on, I think, is sometimes referred to as P hacking. Do you want to explain what P hacking is?

ANDREW: Well, the goal is to get a statistically significant result. It's a result that is unlikely to occur by chance alone. It has a very small probability of occurring by chance alone, and P is for probability. So P hacking is a term that I don't love, but in P hacking, you hack your data; you keep looking at your data until you find something with a low p-value that couldn't have occurred by chance alone. I prefer the term garden of forking paths for the same thing, but I don't think people are necessarily hacking. I think they see their data, and they see a pattern, and then they do the analysis. People get mad at me when I say, "Here is this result. Something came up recently about mind-body healing. They had a result that was 11 standard errors away from zero, which is a lot in statistics jargon."

SPENCER: That's the kind of thing statisticians laugh about, but normal people don't.

ANDREW: Yeah. You're not supposed to laugh; you're just supposed to feel that you're being treated with respect. I'm giving you the jargon. We did our reanalysis, and it turned out, "Well, maybe it was really two standard errors away from zero, which is still statistically significant." But then we said, "Well, there are forking paths. There are many ways of analyzing the data." And they were really mad, like, "How dare you accuse us of forking paths? We know what we're doing." But you can tell because people in their own papers might have five studies and analyze each study a little differently. It's not a bad thing; I do that too. But you can't overstate. It's a problem if it leads people to overstate the strength of their evidence.

SPENCER: I think a lot of people who don't work in science don't realize that this idea of a p-value is so critical to how much, so much of the way science is practiced. It's become this sort of magical thing: you calculate the p-value; if it's less than 0.05, you can publish it. If it's not, well, you can't publish it — or maybe you can go find something else that's less than 0.05.

ANDREW: Yeah, but the funny thing is, people know scientists are not stupid. Even if we set aside people who are trying to cheat, there are those who are doing their best and still do bad work. They know this; that's the funny thing. They don't want to be that guy, and they feel that they have a strong theory and they have a strong literature. There's a principle of mathematics you might have heard of: if a problem is hard, the way to solve it is to embed it in a harder problem. If you want to prove a theorem about prime numbers, you can prove it in the space of ideals, for example, more generally. In statistics, the principle is that analyzing one study is hard; doing a meta-analysis of many studies is easier. I think people realize that the trouble is that the meta-analysis they do, the literature they review, is just a full literature of biased results. It's too bad. The trouble is, the answer is not to just tell people, "Hey, don't just follow the p-value. That's just a number." They realize that. They feel like they're doing the right thing, but I think they're still misled sometimes, unfortunately.

SPENCER: I was talking to one social scientist, and when he said to me about p-values, he said, "Look, everyone has their cutoff." I was like, "What do you mean everyone has their cutoff?" He said, "Well, you know, for some people, if they get p = 0.08, they'll find a way to get it below 0.05, and for other people, it's p = 0.1." That was the way he talked about it. "This is the thing: everybody does this." Do you think it just became accepted scientific practice that you can kind of just, "Oh, well, it's probably true anyway," and "I've got this theory to support it," and "Yeah, sure, I didn't quite meet the threshold, but if I take out a couple outliers across the threshold, it's fine?"

ANDREW: Well, I'll say this: if right now you have a promising drug and you do a trial on it, and you get a result, and it's not statistically significant, but it goes in a positive direction, I probably think the drug is more likely to work than not. The question is, what are the costs and benefits? There's nothing wrong with people publishing things that might be wrong, but when they're designing their new study, they should be realistic about potential effect sizes and again, stay away from this one-sided thinking. We do a study, and it seems to work in this setting, and that's great. Maybe in another setting, it'll have a negative effect. You'd want to know that. I think people should be more open about their uncertainty.

SPENCER: Some people say the p-value actually creates this huge problem in science. We should just rip them out, get rid of them. Do you think that they really present a problem, or is it really the way people think about them that presents a problem?

ANDREW: I don't really use them myself. So I think it would be fine if they had never been invented. I sometimes think that actually a lot of science would be better if statistics had never been invented. For example, if you do one of these studies, what if you just plotted the data? You're allowed to plot, not all the analytic statistics. Of course, there are times when statistical analysis is great. We can learn a lot. For example, when we're doing election forecasting, we have a lot of polls. There is this poll in Iowa that was surprising to people. Statistical analysis is perfect at that kind of problem, at trying to attribute what happened to it. It would be hard to do that in your head, but if you're doing a study and you have some data, you have treatments and controls and pre-tests and post-tests, make a graph, see what the data look like. If the data show a positive effect and there's an overlap between the groups, I don't know if that much is gained by doing the formal analysis. You might as well, but you don't need certainty. Statistics shouldn't be thought of as we wrote a paper once called the AAA Tranche of Statistical Uncertainty. The idea was that statistics is viewed as a tool to convert uncertainty into uncertainty, just like they did with the mortgage things. You keep scraping out the statistically significant results until you find something that is like AAA value, but it's not really, it's noisy. There's uncertainty.

SPENCER: Taking a bunch of really garbage mortgages and then bundling them up and saying, "Oh, look, if I bundle them just the right way, I get AAA."

ANDREW: Yeah, that's how I feel with a lot of that. But I think it's such a relief to be able to be uncertain, to just say, "Well, I'm not sure. This result wasn't statistically significant. I have some uncertainty about the effect. That doesn't mean I think it's zero." It's a wonderful feeling.

SPENCER: Now, when you read papers with p-values, typically people will say, "Oh, it was p less than 0.05, therefore the effect was there," or "p greater than 0.05, we found no effect," which is basically treating it as a magic threshold. But I would argue that it's much better to not treat it as a dichotomy and just think of it as evidence. Would you agree with that?

ANDREW: Yeah, the p-value is the probability that a result as extreme as what you would see could be seen under a certain model if the world were a random number generator, which it's not. I am interested in measures, although I would prefer to put things on a real scale, so when I said an 18% shift in opinion, or increasing sales by 2%, or reducing the death rate by 5%, that's how I'd rather do it. I don't see the benefit of doing it in terms of what's the probability that we would have seen something as extreme or more extreme than this, that the data really came from a random number generator. The very fact that it's so hard to say it is an indication that I think it's the wrong thing to say.

SPENCER: A journalist I know went around a science conference, and she was talking to people who use p-values every day, and every one of them, she said, "What's a p-value?" Only one of them could give a quick, succinct explanation for it. It's such an unnatural kind of awkward idea, and everyone wants to convert it in their head into something it's not. They want to say, "Well, it's the probability the result is true." "No, it's not that."

ANDREW: So when someone does something that you think is stupid, you want to think, "Why are they doing it?" This hypothesis testing has a role, and people are concerned that in the absence — remember, I said I think the world would be better without statistics in some way, that maybe baseball would be more interesting, right? No more three true outcomes.[a] Okay, so [b]imagine a world in which we didn't do statistical analysis; we just made graphs. Then people did what I said, which is they said, "Well, if you see a positive result, then report it as positive, and just say you're not sure, but it looks like a promising therapy. Then maybe that would get approved, you know, as a drug or as an educational intervention or whatever." Then the concern is that it creates a moral hazard by which researchers would have the incentive to do really noisy, poor-quality studies, because all you have to do is get a positive result and publish it. You don't need to reach a p-value threshold. Just do a study with 10 people; 50% of the time, it'll be positive. So there's a need to not have the moral hazard of encouraging people to publish really noisy things. The p-value serves that need. I think there are better ways of serving that need, but we should respect what the need is to understand why people would be doing it.

SPENCER: Yeah, and I think some people have talked about getting rid of p-values, not in the way that you get rid of p-values, but just as, "Oh, p-values are bad. We should get rid of them." Well, it's like, "What's the bar then." At least there's a bar. Yes, people learn to game the system. But if you just get rid of the gate, does that mean it's going to be better? Or now there's no bar?

ANDREW: Oh, I think that the bar would just be a different thing. You might say, "We might have rules about looking at the graph." People always ask this, "Well, there's no, you know, it'll be anarchy." But think of all the studies that are submitted that happen to have a p-value of less than 0.05. They don't all get published. So the journals are still having to decide. I think the same rules they use about plausibility of the finding, importance, quality of the measurement, I think it would be fine to reject a paper setting aside p-value by just saying, "These measurements are so noisy that they're not answering the question you want to be answering." And journals already use judgment, so I think that's fine.

SPENCER: A topic that the general public seems to talk about a lot more with regard to science than statistics is the politicization of science. "Oh, well, maybe you know, because scientists tend to be progressive, most academics tend to be progressive. That might create biases in the way they do research. Maybe it's going to be harder to publish something that contradicts progressive narratives. Maybe it's easier to publish results that go with progressive narratives." To what extent do you think that that's true, in your experience?

ANDREW: Yeah, I think it's true. I don't have a great thought on this. I'm still thinking it through, but I do feel that there is certain political content to unreplicable research. For example, a lot of the unreplicable research in political science and psychology has to do with there being hidden forces that drive your actions. Now, I think there are a lot of hidden forces that drive our actions, and a lot of science is trying to find those hidden forces. But what's characteristic of these studies is not that they're saying, "Oh, you're driven by your subconscious or whatever, or driven by some enzyme happening, and you don't know what it is." The studies have this push-a-button feeling. I'm a researcher, I push this button and that causes your opinion to change, or I'm a researcher, and I put this word on the questionnaire, and this causes you to walk slowly, whatever it is. I think, actually, there's political content to that which I disagree with. For one thing, when it's applied to political science, if it's really true that 20% of women change who they would vote for based on what time of the month it is, which, again, I don't think is true because we have direct evidence or we survey people. But if that were true, that would suggest that we shouldn't take votes very seriously because voters are a bunch of crazy people. I feel like there's political content, which is neither left nor right. In some sense, it fits into a kind of cynical leftist, extreme leftist view to say, "Oh, politics is all a joke. Everything's run by a few people pulling the strings." It also kind of is consistent with a conservative view, "Well, voting isn't to be trusted, and that's like a kind of anti-democratic take." I think there's political content, but I don't know that the people doing the studies think of it as having political content. I think they just feel like they're doing research, but they're working within a very political paradigm, I think.

SPENCER: Yeah, I guess at the end of the day, it's part of the competition. Everyone's fighting to get published. It's incredibly hard to get published in these top journals. If they can get a slight edge, whether it's by using slightly fishy statistics or by making their message a little more palatable, maybe that's part of the incentive structure.

ANDREW: Of course, if you can really push some buttons and change people's behavior, that's worth a lot of money. This kind of research can be potentially valuable, although I don't think it's really working. So I'm not really so concerned about its implications.

SPENCER: If we took seriously the worldview of a lot of these failed to replicate studies, it would be like, "Oh, constantly, everything around you is altering your behavior to a shocking degree. If you just read the word 'slow' on an ad, then you'd be walking slower and so on."

ANDREW: Yeah, you'd be subject to many influences, whether your older sibling is a man or a woman, whether the local college football team won their game the week before the election, your various hormone levels, whether you're married or not, etc. It would be a strange world, which I don't think is accurate.

SPENCER: One thing I wonder about is when you criticize science and make valid criticisms of science, some people might say, "Well, look, there's so much science denialism these days. There are people that believe all kinds of bizarre, crazy things that are totally anti-science, and don't criticize science to support their worldview or give them fodder?" What do you think of that critique?

ANDREW: A couple of colleagues and I — Eric and I — recently wrote a paper with some terrible academic title called A Statistical Case for Qualified Scientific Optimism. What we argued was that although replication rates are pretty low in certain fields of medicine and psychology, most results are in the right direction. Just because something doesn't replicate doesn't mean that the effect is not there. I think a lot of things that don't replicate are in the right direction, but the effects can vary. So I think basically the short version is it depends on what you're studying. I think there's a lot of high-quality science, and even stuff that is not statistically significant, but there's bad stuff too, and the characteristic of the measurement is very important, and that's not something that we usually talk about.

SPENCER: And have you gotten the critique from people that, "Hey, you're supporting anti-science; you're giving them fodder to attack science with by pointing out all these flaws?"

ANDREW: Well, no one would say anything mean to me personally, but I know that the science reform movement includes a kind of science reform reform movement, and they will argue that the science reform movement is too negative, and so some of our recent research has been trying to explore that.

SPENCER: I've heard you were methodological terrorists. Is that right?

ANDREW: No, I'm not. I was called a methodological terrorist, but I think that the term terrorism is inappropriate.

SPENCER: It's a slight exaggeration. Shifting topics slightly, if we look at claims about health or even some psychological claims, many of them involve things where it's very hard to randomize people. A classic example would be, you've probably seen studies on how coffee reduces strokes and heart disease. If you think about how you would really prove that, it's pretty difficult. You would probably have to randomize people to have different quantities of coffee, which people really don't want to do. You'd have to do it for years, and you'd probably have to do it with hundreds or thousands of people, and then monitor how many heart attacks they have. That's a pretty hard study to do, not to say that it's impossible. Sometimes people do pull off studies like that, but it's pretty rare. Much more commonly, what you're really looking at is some correlation. Someone has taken a big data set where they ask people how much coffee they drink and whether they've had a heart attack, and they're running a correlation. This is what gets reported in the news a lot. A lot of the health information we see from influencers and from the news is this kind of relationship. I just wanted to comment on, to what extent should we completely dismiss this kind of evidence? Or do you think there's something to this kind of evidence?

ANDREW: I don't think we should dismiss the evidence, but I think serious researchers in these fields are looking at intermediate outcomes and trying to understand mechanisms. Epidemiology can tell you stuff, so I think it's fine to report things like that. How do you make your decision? It's based on a number of factors, but such studies are not useless. Even if you had a randomized controlled study, it only applies to whoever's in the study, and it only applies to the past, not the future. I wouldn't hold out for something that wouldn't itself be perfect, anyway. You have to make your decisions now. As the saying goes, not making a decision is itself a decision.

SPENCER: How do you think about the strength of that evidence? There are a lot of things where we don't have super high-quality evidence. All we have is a correlation between eating this food and this health outcome. Do you think of that as, "Okay, I should only slightly update my probabilities that this is healthy?" Or do you think that's actually quite a bit of evidence? If someone finds a pretty strong association?

ANDREW: Yeah. I don't know. In terms of living my own life, I try to follow recommendations that I think are reasonable. I'm not an expert at all on medicine, so I would hope that the people who look at these studies assess the reasonableness of the mechanism. I know that the coffee thing came out, and it is controversial. There are some people who say that it's very plausible, and there are others who say it's not. It's hard for me to judge. It is easier to focus on the really bad studies because then it's super clear. I just don't believe some of these mind-body healing claims because I've looked at the evidence they consider their strongest evidence, and I don't find it very strong. But the dietary things, I have much less of a clear beat on.

SPENCER: There's this really hilarious chart. It's called "Everything Cures and Prevents Cancer."

ANDREW: Oh, I've seen that.

SPENCER: They show coffee and chocolate and whatever, and they show all the studies that prove it causes cancer and those that prove it cures cancer.

ANDREW: There are people who are supposedly reading these studies. Medicine is such a big field that they have medical review journals. There are journals that come out every month, and all they do is publish review articles on various subfields. People are looking into it. I don't know how that's being done. I'm not the right person to ask about how people synthesize literature.

SPENCER: Well, one thing that comes up in those studies, obviously, is the naive correlation. Just correlate coffee with heart disease. There are problems with that. So it could be that actually, you know, coffee reduces heart disease. That'd be great. It could be that maybe we know when people get healthier hearts for some unknown reason, they change their coffee behavior, right? It could be there's some third factor that we don't know what it is, some unknown factor that affects both coffee drinking and heart disease. So we can't distinguish those from the data. Savvy researchers know this. And they're going to try to address this.

ANDREW: They try to find comparable people. So they try to find similar people who are very similar, except one drank the coffee and one didn't, and look what happened. I have seen some of these studies that aren't done well, and that's part of the issue. Sometimes it's easy to see a problem, but when you can't see a problem, it doesn't mean it's right either. It's unpleasant to feel that way, but I don't know what we can do about it, just try our best.

SPENCER: Yeah, it seems like with a lot of these things, embracing uncertainty is really the only thing we can do.

ANDREW: There are people who research this stuff, and I probably trust their take on it, even if they disagree. They would offer a more informed take, just as I'll be able to offer a more informed take on political science, even though I'm not always right about it. I am more likely to be able to notice issues.

SPENCER: I predict if you dig into the health studies, you're going to be sorely disappointed. But be curious if you ever do that. Let's talk about meta-analysis, because sometimes people say, "What do you do in science?" You have all these studies, and they don't always agree with each other. It puts you in a weird position. You want to trust the science, but the science says one thing, and it also says the opposite. Sometimes people call randomized control trials the gold standard, and sometimes people say, "Meta-analysis is the platinum standard." What you do is take a whole bunch of studies and aggregate them together, because that's better than any one study, and then you draw your conclusions based on that. I know you've done some interesting work looking at some meta-analyses and saying, "Things are not always what they seem."

ANDREW: Well, garbage in, garbage out. We looked at a meta-analysis of about 200 different studies of so-called nudging, and the individual studies themselves were biased in the sense that there was selection on large effects.

SPENCER: So nudging is trying to change behavior through small changes in the environment.

ANDREW: The example that works is defaults. You change the default behavior on just about anything, and most people will do the defaults.

SPENCER: So to see your savings account, if it automatically gets invested in the S&P 500, most people will just leave it, whereas if it doesn't automatically get invested, they'll just leave it.

ANDREW: But yeah, and the other not just basically how defaults work, but defaults pretty much so. And the other kind of nudge is the cafeteria, and you put the healthy food near the checkout rather than the junk food, and people still find the junk food, whatever. So there's a lot of literature on this, but yeah, it's kind of mixed. If medical meta-analysis can be a little better, because medical studies are often done in a much more controlled way. They have to report everything. But sometimes, they had a meta-analysis of some controversial COVID treatment. This was a few years ago, and when you looked at the individual study, it was like, "Oh, there's like four people in some study and somewhere, a doctor did it." They weren't really controlled trials. They had different outcomes, so it was kind of garbage in and so yeah, but I think meta-analysis is wonderful. In fact, I think you should be doing meta-analysis even if you only have one study, even if you have zero studies. Well, the point is that you're doing it with the other studies that could have been, I'll say, we have a prior on what the other studies would be. Here's the story. You do a study, you find a result. Here's the estimate of the uncertainty. You forget about the p-value, wherever you publish this. This is great, you know so much. Then you do another study and the result is different because it's on different people. It's different conditions. They have different health conditions, whatever. So then you have a different estimate and a different uncertainty to different places, so now you have more uncertainty. So that's weird, that just kind of universally adding a second study gives you more uncertainty, that shouldn't be right. It shouldn't be that more data gives you insight. That's wrong. And so the answer is that even when you only have one study, you should be doing the meta-analysis and say, "Well, how much could this effect vary? No, I don't know it. So I have to put in a prior. I have to make a guess. But it shouldn't be zero." So it's a very weird thing. I think meta-analysis is a very good idea. It's very important. It's a very weird thing that, if you think about it, getting more data increases your uncertainty. But that's because when people have only one study, they tend to sort of forget that there's variation. So it's weird, but that's how it is.

SPENCER: Because if you have just the one study, and you're like, "Oh, well, this can increase the thing 20%," you just kind of forget the fact that there's a huge uncertainty on that 20%.

ANDREW: Well, not just that there's an uncertainty in that 20%, but there's variation that that 20% is for these people in this situation. A simple example would be, you have an educational treatment, and they tend to work on students in the middle of their range. Because the best students don't need it, the worst students can't make use of it. So if you apply it to a certain population, you can have zero effects because they're not in the middle. So you can have the exact same treatment, but just put it in different kids, and you'll get different results.

SPENCER: So is the idea of the one study, meta-analysis, basically thinking about the fact, "Okay, I did this study on this one population. If I had broadened it to the whole, let's say, the whole US population, that would actually increase the uncertainty. I can actually think about what effects that would have, even though I've never run that study."

ANDREW: Yeah, that's the idea. You would create hypothetical other studies and make assumptions about them.

SPENCER: When meta-analyses are conducted, you're combining many different studies to try to get a better answer. We talked about garbage in, garbage out. How do you think about what should be included versus what studies you should just say, "Don't even put it in the meta-analysis?"

ANDREW: Well, sometimes people say they'll include stuff, but they'll down-weight it if the data aren't as good. I don't believe in down-weighting things. Rather, what I will do is add a bias term with uncertainty. So I'll say, "Oh, this study could be very biased." Some studies are just unbiased for their local population. They're really clean, but usually there are potential biases, and so you have a sense of how large those biases could be. When we did our election forecasting model, we allowed for the possibility that polls could be off systematically one direction or the other. That gave us more uncertainty about our forecast, and we set our prior for that based on previous data, previous polling errors.

SPENCER: Because usually when people combine studies, there's essentially an implicit assumption that they're not systematically biased. They're assuming, "Okay, yeah, this thing's not as good, but that just means it introduces more noise." But if they're biased, then all bets are off.

ANDREW: Yeah, and people know about this. It's not like there's a whole field of biostatistics, and they've been writing papers about this for decades. So it's not like I came up with this idea, or I'm coming in and cleaning up the field. People know about modeling bias. But the other thing is that meta-analysis tools, for maybe the past 10 years or so, have been much more accessible. So you can throw data into a computer program, and it spits out the meta-analysis. So of course, you get people doing a simple analysis that gives what they want. But these ideas fit together; it's not like people aren't working on it. People know about it; it's not a big secret. Nothing's a big secret. That's the secret.

SPENCER: Another thing that comes up when you're combining studies is, can you really combine them? Are they similar enough? Because, for example, let's say you're trying to answer a very reasonable question: does meditating reduce anxiety? It sounds like a reasonable question, okay, but what do I mean by meditating? Because you could have everything from, I used an app on my phone five minutes a day for a week to go on a three-month meditation retreat where I was with a guru and didn't speak to another human. And also, what do you mean by reducing anxiety? There's probably at least a dozen different ways you could measure that over different time periods, using different scales and so on. And then, yeah, then you want to say, "Okay, well, I have a bunch of studies on meditation and anxiety. Somehow I want to answer the question, does meditation help anxiety? How do I even merge all this stuff together?"

ANDREW: Well, you definitely start by not asking the question that way, because I would say, in what settings does it work and for which people? For me, for example, put me in a three-month yoga retreat where I can't talk, and that would make me very upset. I can tell you that would be horrible. Again, in meta-analysis, ideally, you analyze the individual data, and you can say, "Well, there are certain people like me for whom it's going to be counterproductive." Also, you can have study-level predictors. So in your meta-analysis, you have characteristics of the study, and that should be part of the analysis. So part of the thing that you'd be attempting to learn is, where does it work and where doesn't it? Now, the trouble with that is, it's very hard to estimate these variations, so you'll just have a lot of uncertainty at the end. But as long as you're willing to accept that.

SPENCER: Let's talk about polling. In the first Trump election, a lot of people were shocked that Trump won. A lot of polls seemed to say that he had almost no chance. What actually happened there, from a statistical point of view?

ANDREW: Oh, well, the polls did very well in 2015 and 2016. When Trump was running for the Republican nomination, lots of people thought he didn't have a chance. I didn't think he had a chance, but I'll tell you he was leading in the polls. The polls provided a lot of information that people didn't want to hear. In the general election, the polls were off by about two or three percentage points, and I think that had to do with who was voting. Trump attracted a bunch of people who were typical non-voters, and non-voters typically don't respond to political polls either. I think the polls were missing a bunch of people who were going to vote for Trump. Being off by two or three percent is not so bad. People are out there saying, "Oh, our response rates are so low. Survey response rates are really low. It's crazy. People don't respond to surveys." It's silly unless they pay you. Why should you give them your time, right? Back in the 1950s, response rates were high. If you were polled by Gallup Poll back in the 1950s, I would advise you to answer the poll because you'd be one of 1,500 people, and your response would have an effect on the news headlines all over the country. Nowadays, so many polls are happening. I once read a paper where they polled people to ask them how often they had been polled, which is tricky because you get an overestimate because people don't answer polls. It actually was logical. Over the decades, pollsters have drained the aquifer of public participation, but the polling errors now are no higher than they were 60 or 70 years ago because we're running to keep up. We do a lot more statistics now to analyze the data. Let me tell you a bunch of numbers: 1883, 1884, 1888, 1960, 1976, 2000, 2004, maybe 2006. No, I'll tell you the answer: 2016, 2020, 2024, the close elections. There's a period after the Civil War where there were a bunch of very close elections, and there weren't for a while. For most of American history, if you could predict the election within three percentage points, that's pretty good. So yeah, Ronald Reagan's going to win. You're off by 3%, but recently we've had a bunch of close elections. The problem is it just happens that in 2000, 2004, and 2008, the polls happened to be very accurate, just coincidence. It's a little bit by luck because there's variation. They happened to have very little bias. Those were elections that were surprising. In 2000, people thought Gore was going to win. He did win, but not by very much. The polls said the election would be close, and people weren't always believing it. 2004 was actually a little closer than people thought it would be. Again, the polls were accurate. In 2008, people thought Obama wasn't really going to win. Who's really going to vote for this guy? But actually, the polls were accurate. After that, a bunch of people did poll aggregation, and I think a lot of consumers of polls thought that they were awesome. In 2012, the polls were off by a little bit, but they predicted the winner, so people didn't notice. Then 2016 came along. People had unrealistic expectations.

SPENCER: That's my story. So basically, it's luck that lulled people into complacency.

ANDREW: Yeah, being off by two or three percentage points is pretty good, and you're not going to predict the winner in a very close election. But nobody said you had the right to predict the winner in a very close election. You don't. Nobody owes you that.

SPENCER: Wouldn't it be better if we just didn't even think about it, instead of everyone tracking the polls for the three weeks prior.

ANDREW: Oh, they do too many polls. It's ridiculous. I think it's good to have some polls. I mean, there are election security issues. There are countries where the election is much different than the poll. That's a concern. Also, most polling is not about the election. Political horse race polls are loss leaders for pollsters. Pollsters make their money from market research, but then they throw some political questions on so they can get in the news, and then they say, "So and so pollster." But also, polls about public opinion are valuable. It's probably good to know that something like 60% of Americans support the death penalty. That's something I haven't looked at the data recently, but it's something like that. Obviously, there's been a lot of policy polling recently. I think that can be very valuable.

SPENCER: So I've heard that only something like 10% of people answer their cell phone from random numbers. How do they deal with that? That seems devastatingly bad if you're trying to poll people.

ANDREW: A lot of pollsters use internet panels now. So what you do is you collect people, you pay them to occasionally answer surveys.

SPENCER: You just have your standard set of people that are demographically diverse.

ANDREW: Yeah, you work really hard to do that, and then when you do your survey, you make sure you get the right number of young people and old people and different ethnic groups and so forth. And that's not perfect. So in 2024 there were polls that did this and still got the wrong answer, although there were some that did very well. So how to do it is tricky, that's for sure. How to reach the people is tricky. How to get the list of people is tricky. How to get people to respond even after you've paid them is tricky, and how to do the analysis is tricky. So I work in it; it's all industry.

SPENCER: I feel like it would be letting people down if I didn't ask you at least one question about Bayesianism. Since you probably don't know this, but he's maybe the most famous Bayesian in the world. So yeah, make your plug for why do you use Bayesian methods when very few people do? And what are these, just for the people who don't know about it?

ANDREW: Yeah, so Bayesian statistics is, well, the simplest way of saying it is that you are trying to learn something that you don't know, and you have data on this, and you combine the data with your priors.

SPENCER: So you're kind of getting your prior beliefs that were true, right?

ANDREW: I like to say prior information rather than prior belief, because actually, you're not combining your priors with your data. You're combining your prior information with the model connecting the data. So you have to think about, there's something I care about, and there's data, and there's a model connecting the data, like a model saying, "This is a good measurement of that. And I did a randomized experiment, blah, blah, blah." And then there's another model connecting the things that you care about that you want to learn to your previous knowledge, and that's your prior and that's a model also. And why is it useful? Because we have a lot of situations where we have a lot of prior knowledge. We have a lot of previous elections, previous polls, even a medical study on an unknown topic. I do a new medical study. We have databases of old medical studies, and we have a sense of how large effects typically are, and from that, we have a prior which can help us get more realistic estimates. It's not magic, but if I have an estimate, I say, "This was statistically significant, and it looks like it reduced the death rate by 4%," then I can do a Bayesian analysis and say, "Well, my best estimate is not 4%, it's 2%, and here's my estimate of whether it's negative or not," and that's a little better. So I think we can do a little better. That's why I like to do that. Sometimes you have very little data, and then it's very valuable. So for example, if you want to do inference for subgroups, you just don't have a lot of data in the subgroup, then your prior model is going to be more important.

SPENCER: So just to explain this a little bit, normally, in typical frequentist statistics, you're running a study, you're just going to analyze that data and think about the data you just collected. But with Bayesian methods, you can say, "We have all these past studies on related topics." We can say, "We know how often it occurs that we get an effect that's 5%, 10%, 20%, and so on." You can integrate that information to essentially improve your estimate. Is that right?

ANDREW: Yeah. People think it's cheating because it's like, "Look. I want to prove something works, so I'll just throw in a prior saying it works." You can do that. I think you just have to be transparent. If someone publishes a paper and says, "Here's my data and here's my prior. My prior is, 'I'm pretty sure this is already going to work.'" Then I can say, "That's not my prior. I don't believe that." So transparency is a necessary part of that. In a way, that is different from classical statistics because a lot of classical statistics is set up as automatic. You just push a button and run the analysis, and the number comes out. Here we really are saying, "We're very explicit about your assumptions." I think that's super important.

SPENCER: I think, on the other point you were making, imagine you're trying to model what every single state in the country believes about a thing. Some states you have lots of data, and some states you have a little data. Normal statistics, you just conclude, in the states where we only have a little data, you just have this really huge uncertainty. But Bayesianism says, "Well, maybe you can use information from the other states to learn to..."

ANDREW: Well, let's be fair to those other people. What they wouldn't do is say, "Well, I don't have enough data on this state, so I'm going to use just this time, I'll use the old estimate from this state, or I'm going to combine these." If you look at the CDC, when they have maps of disease, for rare diseases, they do it by county, but in a state like Texas, which has hundreds of counties, they'll combine small counties. What they do is combine data to get a stable estimate. The trouble with that is, then you lose your specificity. It's a little like Heisenberg's Uncertainty Principle; you can get a precise estimate as long as you're willing to estimate something that's spread out over the whole country, or you can get a very noisy estimate about something local. If you want a more precise estimate about something local, I think you have to do more modeling.

SPENCER: Before we wrap up, what do you wish people knew about statistics?

ANDREW: I don't know that people need to know about statistics. There are other things they should be aware of. They should be aware of uncertainty and variation. The psychologists Tversky and Kahneman wrote a paper in 1971 called Belief in the Law of Small Numbers, and there is no such law of small numbers. It's a belief, and the belief is that the part is representative of the whole. If I do a survey of 100 people, I should get roughly 52 women and 48 men. But if I get a small sample or if I have noisy measurements, the part won't necessarily look like the whole. People tend to think that all the evidence goes in the same direction. When they write a paper, it's like, "Look at all the evidence." But in real life, the evidence doesn't go in the same direction. Things happen. You will see things that go the wrong way because there are things you hadn't thought of. When thinking about science or life, people should be aware that the evidence doesn't go in the same direction, but also that it's very natural not to realize that.

SPENCER: Before recording today, I learned a mind-blowing fact from you, and it wasn't about statistics. It's my final question for you. You told me that you don't have a cell phone and you never check your email before 4 PM. Is that the secret to your incredible productivity?

ANDREW: I think I was productive before cell phones were in common use, and at that time I also checked my email before 4, so I don't think that would be the source of my productivity.

SPENCER: Andrew, thank you so much for coming on.

ANDREW: I really appreciate it. Thank you.

SPENCER: All right, let's just do a few questions from the audience, and we'll be sure to repeat them just so we can get them on the recording. Yeah, Jonathan?

AUDIENCE 1: I read something over the weekend where all the top frontier models, LLMs, were asked if you have a car wash a hundred yards from your house, should you walk there or drive there if you need to go to the car wash. I think all of them except one — I'm not going to name it because I can't remember which one got it wrong. This is just a probably somewhat irrelevant way of asking, how do LLMs do at statistical problems? Are they an entry into the world of statistics that changes things dramatically? I'm sure in most areas they express a very high degree of confidence in the quality of their answers. But what is it? How is it useful? Is it a useful tool in ferreting out fraud or mistakes? What is your experience with LLMs? How do you think LLMs are sort of impacting our world?

ANDREW: I've never used a chatbot myself, actually, so I can't, but I have some answers to this because I've talked to people about it. So yeah, the answer is chatbots can be super useful in a lot of ways. First, it's a version of googling stuff. So it's a way of searching. I use Google because I'm an old person, but this is what people do. You could do what you do in Google. When you find something in Google, you have to click on the reference and look. Similarly, if you use the chatbot, you can look stuff up, and then you can try to find it elsewhere and check it. In that way, it's like having a really accessible textbook. Even if the chatbot is doing nothing but rearranging words that are already on the internet, if you ask something about statistics, you get a lot of technical stuff. That's probably pretty good. You're not asking about a conspiracy theory. It's as good as the quality of what goes into it. I think when it's a technical subject, or how to repair your fridge or whatever, it's probably pretty good because there aren't a lot of fake owner's manuals out there. Beyond that, it can be used for coding. I have not done this, but I've been told that if you want to make a graph, if you're willing to work a little bit with the chatbot, you can get it to make really nice graphs. You give it some data. If you just say, "Make a graph, it might make a horrible graph," but you give the data, and then you can say, "I want to write a program in Python that does this, does a grid of this graph and so forth." Apparently, that's very effective. Can I just throw data in and say, "Please perform a statistical analysis, or throw a paper in and say, please find the flaws in this paper?" I guess it'll do something. I doubt it would be very useful to me. It's a little bit like people say, "Well, you could use a chatbot to write a scientific paper." Well, could I use it to write my scientific paper? Maybe sometimes, I'm asked to write a review article about a subfield, and I could maybe have the chatbot do it and then clean it up, but most of my scientific papers are research. I don't know what I'm trying to do. There's no way it would work for that. But that doesn't mean it's valueless. We shouldn't judge a tool by what it can't do.

SPENCER: Other questions?

AUDIENCE 2: So when you're looking at clinical trials for drugs, are there any common signs that you see where you look at the study and think, "Oh, this is clearly trying to massage an outcome here?"

ANDREW: Ideally, you wouldn't have to worry about what they're trying to do. In the ideal world, all the data would be available so you could reanalyze. The hard thing is, most studies are just more complicated than you'd think. It's very rare, even in a clinical trial, that you're just comparing two groups. You might have time series involved, multiple measurements; the analysis can be kind of complicated sometimes. Sometimes you see problems, like they're not adjusting for age. Or what if you adjust for age linearly? That might work in some settings, but it wouldn't work for something like COVID, where there's a big nonlinear pattern. I like to see people show the raw results and then show what happens after you do the adjustment.

SPENCER: That's always an interesting one when they'll report a more complex analysis, but then they don't report the simpler analysis. That's something that I've noticed when we replicate papers; that's a point of suspicion. We're like, "Why didn't they report that?" You'd expect them to report the simple thing. So let's look deeper. We'll redo the simple analysis and make sure it gets the same answer.

ANDREW: I want the explanation. This looks like this work, but actually there were more old people in the control group, or whatever it is. You want to have some story, some understanding. Not that the story is always right, but at least that can guide your thinking.

AUDIENCE 3: I'm interested in your final point, which I found very compelling about that in the real world, we should expect the evidence to stack up in different directions, and not all go in the same direction. Taking that as true, in light of that, how would you change either the academic process in any field that you think suffers in this way, or just the general discourse to take into account that complexity, other than just saying more meta-analysis?

ANDREW: I have some ideas that are already being done in different places. One idea is, sometimes you can publish the design of the paper, and then the data come out, and you're going to publish the data no matter what.

SPENCER: Like a registered report.

ANDREW: Yeah. Like a registered report. I don't think it should be required, but it should be allowed. Sometimes people say, "I don't want to do a registered report because I don't want to restrict what I'm doing," but the registered part is just a subset. You could do other analyses too. Another thing is to divide things up. If someone does a great study, they could publish just the data. You shouldn't have to publish if it's interesting; you should be able to just publish a paper that's the data, and then other people can analyze the data. So yeah, I guess moving towards publishing everything.

AUDIENCE 4: Sometimes people create big data sets. I think I remember one of a collection of genetic information from large populations, but they didn't associate any studies with them. So someone can trawl through these big data sets. How do people make sure? How should someone approach such a giant data set to make sure that they're not just finding the outlier significance because they didn't collect the data themselves, but they do want to see if there's anything interesting in that data set?

ANDREW: Sometimes, if it's a huge data set, you can break it up into training and test sets. That's not going to work so well in economics and political science; we kind of know the data is out there, you know, we have some number of elections, and so you can't always do that, but if you can, that would be good.

SPENCER: Yeah, it's interesting because in machine learning, you see that a lot, but in fields like psychology, they almost never do that. But it's a powerful technique where you split some of the data out and say, "This is the data. I'm allowed to do any analysis I want. I can go wild on it, but I'm going to put some in a vault, and only at the very end, double-check that it still matches the data I haven't looked at yet."

ANDREW: Yeah, I think ideally, that's what the scientific process is. So I write a paper, I analyze the hell out of my data, and publish the results which seem reasonable to me. Then other people can follow up. People don't always want to follow up on my ideas. They have their own ideas. I'd be thrilled if people would replicate my studies, but people don't typically replicate their own studies because it's work and it's expensive.

SPENCER: I've heard an estimate that in one subfield, less than 4% of papers ever get replicated. That might even be an overestimate.

ANDREW: Sometimes people replicate their studies and the replications fail, but they do an analysis to make it look like it succeeded. So that's another problem.

SPENCER: All right, let's do two more questions. Anyone else just grab the mic and pass it along?

AUDIENCE 5: I guess it's more of a comment and a reflection as a scientist doing not clinical trials, but the discovery side of science, and having been a reviewer for academic journals as well, I see so much bad statistics being done. They're not using the right type of statistics. I pick it apart: N equals three, and they use parametric with small sample sizes, so we do what we can, but it goes along with the comments about hiding their data in bar graphs. I appreciate the comment where you said we shouldn't use statistics in certain science, and I think that would help solve the issue of trying to force out that p-value or finding the statistic we should use to get that significant p-value, because that's the only way we can publish in these big journals. It would be awesome to figure out a way to push the idea of not needing statistics on these smaller discovery studies and then using them more in clinical trials.

ANDREW: I think one of the paradoxes of a lot of medicine, also policy analysis, is that big decisions get made based on small studies, and once the study is out, then things are just done. It's uncontrolled. In an ideal world, I think our experiments would be more realistic, and I think the real world would be more experimental. There's no reason you couldn't keep randomly assigning treatments even after they've been approved, because there are multiple treatments. Now it's a little weird. Do I want my doctor randomly assigning a treatment? Well, you could randomly assign encouragement. There are things that could be done. I'm not saying it would be easy.

SPENCER: There was one trial where, when you went to the doctor, if there were two treatments, and it was literally unknown which was better, the doctor had the option to push a randomized button, because nobody knows. I think that's a pretty cool example.

ANDREW: I think just in general, if you think about the idea with doctors, nurses, teachers, police officers, and various people who are implementing policies, we as researchers, academics, or whatever we are, should respect those people. The idea that a medical treatment, I mean, sometimes it really is just a pill, but typically it's a therapy of some sort, and the doctors and nurses who are implementing it should be active. They should be part of it. It's not like we figured out what works and now you do it. The same with educational interventions. It's not like, here's a kit and the teacher pushes the button. The teacher has to be involved. Police officers have to be involved and committed and so forth. That's the big picture.

SPENCER: All right, last question, who wants the mic?

AUDIENCE 6: To what extent should intuition inform your calculations, in the sense of, you have an intuition about some causal chain, or some effect that just A has to cause B. You kind of see it, you feel it. As humans, we have so many bits of information that we experience. When you do these tests, you get 20 bits. You test 20 people for one binary thing, and whatever. It's not much information. What is the interplay between the very small amount of data you get from your samples versus your lived experience that informs everything else?

ANDREW: Well, I guess that's kind of the point that we often have a lot of prior information. If you think about a drug trial, they do all sorts of experiments to figure out what's a good dose, and then they do the big clinical trial, and everything is frozen. It's kind of weird if the clinical trial has thousands of people, but they're designing the dose based on some small amount of information, so obviously they have prior knowledge. The easy answer to your question is, you should be able to use your intuitions or prior knowledge to design your study. If I believe strongly that there's a certain pathway, then that would suggest I should take measurements along that pathway. That also might suggest doing a control that shouldn't go through the pathway as a negative control. At the very least, you should be able to use that to design your study. In the analysis, it's a little more controversial, but the design is more important than the analysis anyway.

SPENCER: Andrew, thanks so much for coming.

ANDREW: Thank you for the questions.

Staff

Spencer Greenberg — Host + Director
Ryan Kessler — Producer + Technical Lead
WeAmplify — Transcriptionists
Igor Scaldini — Marketing Consultant

Music

Affiliates

Click here to return to the list of all episodes.

CLEARER THINKING

Episode 303: Should science stop worshiping statistical significance? (with Andrew Gelman)

Contact Us