with Spencer Greenberg
the podcast about ideas that matter

Episode 057: Statistics Intuitions and Social Science Reproducibility (with Stuart Buck)

July 8, 2021

How should math be taught in primary and secondary schools? How much is science denialism caused by statistics illiteracy or lack of statistical intuitions? What do p-values actually mean? Under what conditions should null results be published? What are some of the less well-known factors that may be contributing to the social science reproducibility crisis?

Stuart Buck was Vice President of Arnold Ventures (a $2 billion philanthropy) for nine years and led its nationally-renowned work on improving research quality and reproducibility. He has advised DARPA, IARPA, the Department of Veterans Affairs, and the GAO on rigorous research, and he was recently asked by Nature to publish a commentary on reproducibility. You can find more about him at, follow him on Twitter at @stuartbuck1, or email him at

JOSH: Hello, and welcome to Clearer Thinking with Spencer Greenberg, the podcast about ideas that matter. I'm Josh Castle, the producer of the podcast and I'm so glad you've joined us today. In this episode, Spencer speaks with Stuart Buck about developing conceptual models, and logic and statistics, the implications of publishing null results, and open science and experimental reproducibility.

SPENCER: Stuart, welcome. It's really great to have you on.

STUART: Thanks, Spencer. Great to be here.

SPENCER: The first question I want to ask you is about what you think they should be teaching math in schools because I think both you and I probably share some frustration with [laughs] what's currently taught.

STUART: This came to a head recently, because I have kids who are in high school. And let me just preface this by saying that I myself have studied probably more math than a lot of high school students have. I went through calculus in high school. Then, in my PhD program, I studied linear algebra and econometrics and statistics. I have a research job at a foundation, where I advise on academic research and use that statistics knowledge all the time. But I find myself kind of struggling to remember some of the aspects of high school math — especially geometry, that I guess was a pet peeve of mine as being faced with geometry homework and trying to remember, "What on earth does some of this mean?" and, "Why are we having to study this anyway?" If I have trouble helping my kid with homework, despite my educational background, and the fact that I use some aspects of math on a daily basis, and how much most other people who never went to college in the first place — struggling with helping their kids do this. Then that caused me to think, "Well, why are we teaching this in high schools in the first place?" Not that there's anything wrong with geometry, trigonometry, calculus (for those who want it or for those who have the aptitude and desire to learn those subjects), but how much of it should be mandatory? How much of it is a hurdle for some kids, or a lot of kids — maybe the vast majority of kids that will never end up actually needing to use most of it. I wonder if we should restructure the high school curriculum, more around some kind of basic issues of statistics and data analysis, that could be more useful to a lot of people in just understanding the world, understanding the news, understanding basic issues about whether medicine is effective or not.

SPENCER: I really like your framing in terms of, "What actually helps you understand the world?" and to that point, I always was really bothered by the emphasis on two dimensional geometry in high school. It seems to me that there's certain basics of it, like you should know what a triangle is in a rectangle and be able to calculate the area — that actually comes up in life occasionally. Sometimes you need to plan something and a house or whatever. But then beyond that, there's so much just esoteric stuff involving calculating complicated areas — or even certain kinds of proofs. I think sometimes the justification given there just says, "Well, we're, you know, we're training the students' minds by giving them these like geometric proofs," and so on. What's your reaction to that kind of justification?

STUART: There's something to that, but that argument has no logical stopping point. We could say, we're training students' minds by teaching them all to memorize a bunch of famous chess games. That also involves a lot of skill and kind of understanding of the geometry of chess — the wage and the logic of it, and so forth. It trains your memory. If we did that, or someone proposed doing that for the first time today, you would have a lot of reaction saying, "Well, wait a minute", like, "Why is that the one thing that you pick up to try to change students minds and memory and logic, you know, as opposed to anything else?". We could try to push all students to learn, like real analysis and differential equations. But would that be useful to most people? I just don't know where that argument stops. We can train students' minds in lots of different ways. And maybe we should train their minds using something that is more likely to be useful to at least a substantial percentage of the students one day.

SPENCER: Picking up on a couple of threads that you mentioned there — one is opportunity cost. So it's not enough to argue, "Oh, this thing has some value to teach," you have to say, "Well, but does it have more value than other options that we can teach?'' Some value bars are just not high enough bar to jump over. And the other thing is, I think a lot of times when defending standard educational practices, people will talk about spillover effects. They're like, "Well sure that this thing doesn't really matter and we don't really expect them to remember it, but we're enhancing their ability to do this other secondary thing. We're giving them proofs in two-dimensional geometry because we want to turn them into logical thinkers". With those kinds of arguments, I immediately say, "Well, isn't there a better way to just train that skill directly? Why are you doing so indirectly?". It seems to me it'd be much better to actually teach kids logic — actual logic rather than hope that they learn logic [laughs] through proving things about triangles.

STUART: I absolutely agree with that. To the extent I'm familiar with it, there's a whole kind of educational literature on the transfer of learning — transfer, meaning you learn subject A and it transfers over to subject B. And so you'll find those arguments commonly in educational contexts, people will say, "Well, if you study music, that will translate into not only knowing music but being better at math, because guess what, there's a correlation now in the world of people who are good at music and math". But that doesn't mean that any correlation is not causation, just because those skills might be correlated some of the time does not mean that training people with music, turns them into better mathematicians. In fact, that's not true. It doesn't hold up when studied in a rigorous way. If you want people to learn math, teach them math, if you want people to learn music, teach them music, and justify it on its own terms, not because of some benefit to something else that you could have been teaching directly. I couldn't agree more.

SPENCER: To the point about the transfer, learning to other demands — I think most of those results are kind of depressingly negative, and people seem to barely generalize the pattern. If you teach them x, they're probably not going to generalize it even to like slightly different stuff, unless you actually help them do that. I think that means it's not even enough to just say, "do some math", or not even enough to just teach them proof. It's like, "Well, what kind of proofs you actually want to have them do, because they're probably not going to automatically generalize that to all different kinds of proofs".

STUART: That makes a ton of sense. You used to hear arguments about things like, for example, teaching Latin — that teaching Latin will somehow produce these extra cognitive benefits. Again, I don't think there's any evidence for that, either is for memorizing lots of forms of nouns and verbs might build your memory as to those things, but it doesn't mean that you're now any more logical or have better memory as to anything else.

SPENCER: Well, to be fair, Latin is definitely the best language to catch magical spells.

STUART: Exactly. [laughs]

SPENCER: But no, I totally agree. People will be like, "Oh, Latin helps you learn romance languages like well, you know, it helps you learn them much faster, better, actually, just studying that language you want to learn". [laughs]

STUART: Exactly. If you want to learn Spanish, just go learn Spanish.

SPENCER: Exactly. Tell me a little bit about what would be in your ideal high school math curriculum.

STUART: I think some basics of probability and statistics. When I say statistics, I don't mean like to just hardcore mathematical statistics. I don't think you need to be able to derive the formula for a standard error and why you might divide by n minus one instead of n. I'm not talking about that. But I'm talking about a more conceptual understanding of the world through statistics, people need to learn about, for example, why a randomized trial of a vaccine — such as renouncing with COVID — why that's a useful tool for gaining knowledge about whether the vaccine works. People need to understand the difference between correlation and causation. People need to understand, ideally, some sort of Bayesian interpretation of probability. There's some famous studies where you give even doctors a scenario, and you say, "Well, look, only only 1% of people have this type of cancer, and your test is 90% accurate. So if someone turns up and has a positive test, what's the chance that they actually have cancer?" and people usually get that sort of problem wrong because they don't have an intuitive understanding of how to take into account the base rate in which something happens. Those kinds of concepts are incredibly valuable in just understanding the world and understanding the implications of a medical test that you just took, and understanding how to think about your own health and nutrition. So I think those ideas are much more relevant to just kind of a baseline understanding of the world and then to pick on high school geometry again — than geometry and some of the more esoteric concepts and trigonometry etc.

SPENCER: I completely agree with that. I would also add to that list, things like the mean, which a lot of people kind of have an intuitive sense of. I think, it's just such an important statistic, though. It's worth really understanding it, and then the median' and understanding how that differs from the mean, and when would you use one, when we use the other — those to me feel really critical. Also, I think just a basic understanding of probability distributions — just the idea that you can have a probability distribution and that it can have different properties, that it can be wider, with a higher standard deviation or narrower — to me, this is just the bread and butter of making [laughs] sense of the world by having these concepts.

STUART: Exactly. It applies to so many aspects of the world to so many subjects.I absolutely agree with what you just said.

SPENCER: It's really critical for understanding science and why science does what it does and why — insofar as you can answer questions, well, I can answer the questions as well. I do wonder whether science denialism is connected to people not really understanding why science is the way it is.

STUART: That's a good question. That's a bigger topic, I guess. [laughs] People need a basic understanding of just how scientists reason about the world. And I mean, science —the rules of science — and to go back to Latin is really just knowledge, knowledge and understanding. So anyway, I think we need to get away from the idea that there's some magical category of human beings called scientists who have some special understanding of the world that no one else has. We all can participate in scientific type reasoning about the world. It's just through knowing these sorts of principles, whether it's mathematical or probability, we're just principles of reasoning, we all can participate in better ways of reasoning and gaining access to knowledge about the world. It's not just the prerogative of someone else with the label scientist.

SPENCER: It's just so hard to imagine a world where everyone was trained in this stuff. We're in a presidential debate, they would say, "Well, you know, this thing, follow this thing.” But of course, you know, correlation doesn't imply causation. And I'm only 80% confident that they're causally. Just the possibility that there's just like, "No, there's no [laughs] way it could ever happen". But if you think about it, like if you had a population of people that we're all trading this like bread and butter, like how to think stuff you can imagine actually people being like, "Oh, yeah, actually respect that candidate more, because he assigned a probability to his statement." [laughs]

STUART: So many people on both sides of the aisle are all too willing to confuse correlation and causation and to assume, "Well, you were president at the time that something happened. So, therefore, you caused it to happen,'' we can just get people to understand that's not always true, or not even often, very true. That would be a huge advance in public discourse.

SPENCER: I've heard people say that often presidents are blamed for whatever the last president did. And so it's just like, a sliding door of like, who gets the blame, basically.

STUART: That's another thing. I mean, presidents maybe do have some effects on the economy, eventually. But it surely isn't immediate. It's probably with a long lag time behind it. So you're probably getting blame or credit for — to the extent presidents do have an effect on that you're probably assigning blame and credit to the wrong person.

SPENCER: So let's talk about p-values, and I want to just mention [laughs] a fact about you that I love which is — do you wanna tell the story about the conference and p-values?

STUART: Oh, sure. Yeah. So I was at this conference at Stanford several years ago. And it was a conference of people interested in what you might call metascience — people interested in thinking about how to improve science. A lot of people, they're from different disciplines. There were statisticians, there are people from medicine, epidemiology, and there was a journalist there — science journalist, Christiana Finden, and she was kind of going up to people, and she would say things like, "Well, hey, I'm just asking folks some questions. You know, about the conference about meta science? Do you mind if we set up a hall and you know, ask you a few questions?" She said that to me, and I said, “Sure,” so we walked down the hall.

SPENCER: Did you know it was a statistical trap she was putting you in?

STUART: Right, exactly. It was. The first question, she turns the camera on, the first question was, "So in layman's terms, explain what a p-value is?". Then she created this kind of hilarious video of all these top professors — professors from Stanford, etc, — kind of just laughing sheepishly as they realize, "Okay, how do I explain that in layman's terms?". So yeah, it was pretty funny the way she ambushed people.

SPENCER: I saw her article, and I was like going through it. And [laughs] I believe she said, you are the only person that can crack an explanation of the p-value of those she interviewed. So I thought that was very impressive.

STUART: Well, maybe someone else gave a correct answer. I guess that I gave it in a way that she thought a layperson could understand. Basically, I said, my answer is something like this — imagine that you're flipping a coin. And you wonder if it's a fair coin or not. If it's a fair coin, leave aside the question of whether there is such a thing as a fair coin. But if it's a fair coin, then if you flip it 100 times, you would expect around 50, heads and 50 tails. It might not be exactly 49:51 or 48:52. Or the further away you get from 50:50, the less likely it would be if it's actually a fair coin. So the logic of a p-value is you flip the coin 100 times you assume it's fair, and you assume that it should be 50:50. But then if the day to turn up 60:40, one way or the other, then you ask yourself, "what is the probability that I would get this result or something worse from a fair coin where the true result should be 50:50?" And that's what a p-value is basically telling you. What's the probability that you would see data this extreme or worse, or more extreme, if in fact, there is no effect there to be found.

SPENCER: That's a great explanation.

STUART: That's the best I could do. But it's still something I find a little frustrating, because it's not all that intuitive. Given the absence of an effect, what is the probability of data this extreme or worse? That's not an intuitive explanation. There have been surveys I've seen, even statisticians who get the definition wrong in some technical way. It's also not clear that it's the most useful question to be asking about data when you're trying to make decisions about the world. I think p-values are a little more frustrating.

SPENCER: But the funny thing is, I mean, almost [laughs] every paper in social science has a p-value somewhere — at least if it's quantitative to science. This is just us constantly, from us all of these scientific findings. As you point out, there's something very awkward about the definition of it. So the human mind just wants to keep mapping it on to things that are less awkward, like, "Oh, what's the probability that the result is true?" or things like that, which, in fact, is not quite what it means, because it's saying, "Well, if there's no effect, how likely is this to be?" but you're sure it's conditional, whereas your brain wants to just say, "How likely am I to be right?" So I think what the brain wants to do here is kind of using more Bayesian way of thinking it's sort of more natural for our brains being like, "What's the probability of this?" Would you agree with that?

STUART: Yeah, I think in fact that what we typically want to know about the world are about as a study is the complete inverse of the p-value. If p-value is telling you, “Given no effect, what's the probability of this data?” And what you really want to know is, “Given this data, what's the probability of an effect?” You can't just flip around probabilities like that. So you can't take a p-value and convert it into what you really want to know, which is “Given that I did this experiment or this study, and I have this data before me, how confident should I be that the effect is really there, and it's really meaningful?” That's the kind of natural interpretation that people want to take from it, but it's actually the flip side that has a p-value.

SPENCER: That's a really interesting point. I also want to add just a little bit of nuance there which is it's not even really evaluating the pump of the data, [laughs] it's finding the probability of getting a result — this extreme or more extreme, right? So it's not even just flipping it to the other side of the equation. There's something even weirder going on.

STUART: Yeah, it's asking about the probability of data that is more extreme than what you had before you, which is kind of weirdly abstract, and that's probably why that's the reason you would want to make a decision in terms of your actions.

SPENCER: The way that I think about p-value, (So now there's this whole p-value war, and some people think that journals should make you not report them.) other people say, "Oh, instead of using the standard cut off of 0.05 to publish, we should lower that cut off, maybe make a point or one or whatever". Kind of flame war, it's going back and forth. But from my point of view, I'm curious to get your reaction to this, I think the p-value is just one tool in the toolkit but a very useful tool. Basically, what it tells you is whether you can reasonably rule out sampling error. In other words, let's say, you measure something on a population, if you did it on the small population, you might get a lot of variability just based on which particular people happen to enroll in your study. If you do it on 20 people, you pick 20 people at random, you happen to get those 20, instead of another random 20. Maybe that will change your results. The way I think about it is if your p-value is really low, you essentially for the most part can rule out sampling error, in other words, that your result is just due to the particular participants, particular data points you got by luck. And that really that's it. [laughs] That's all it really tells you. What do you think of that interpretation?

STUART: I think p-values are one analytical tool. But let me add to what you just said, though, because I'm not sure I agree full-heartedly and so maybe there's something to clarify. The way I see it, if the problem with small sample sizes is both you can have a lot of false negatives, and some false positives. If you study a sample size of 20 — for any sort of typical human level effect, whether it's medicine or psychology, whatever — to emphasize a 20 is probably too low statistical power to be able to get a low p-value. And to say that there's an effect, there's probably a lot of true effects that you'll miss because it's just pointing to small sample size, and it's going to be too noisy. But sometimes because of sampling error, you might end up with several kinds of fluke results in your small sample size of 20. Then you find a low p-value because you've got this outsize kind of fluke-ish result and then you run to the presses and publish something and say, "Wow, huge effect, low p-value must be true." In fact, I've even seen Jim Heckman, the Nobel Prize winning economist, literally make the argument that because there are a couple of studies of preschool, they had super small sample sizes. But also found a large effect, that means we can be more confident that they're true. I think that's just the exact opposite of what you should completely flip. It's a very small sample size, and a large effect that shows that the small p-value, you ought to be a little suspicious and go back and try to replicate it.

SPENCER: Those are all important points. I agree with you on all of them. So I was making a more kind of discrete technical point, which is that if you get a very small p-value, you basically — on just a single result, like you did one calculation, that's it, you get a really small p-value — then you can basically rule out sampling error as an explanation. So you can basically rule out it being due to noise.

STUART: Okay, yeah.

SPENCER: That's all I meant. But in practice, if someone reports a p-value of 0.04 on a sample size of 20, I think you and I both agree that there's a substantial [laughs] probability that the result is not going to hold up if you don't kind of look into it further, or try to replicate it.

STUART: That makes sense. Yeah.

SPENCER: I find it very silly, the idea of throwing them away, because I'm like, "Well, there's actually a really good tool for rolling out a sampling error.” So that's how I tried to use it in my work. But of course, there's — just to make sure we cover our bases — there's absolutely nothing special about the 0.05 cut-off, that is so often used in science, that's completely arbitrary. But that being said, I think there is some value in having a cut-off. Because imagine you there was no cut-off whatsoever and people could publish a p = 0.20, I can't imagine that would actually make things better, rather than making things worse. So what do you think about that?

STUART: That's a good question. I mean, we're hitting on publication bias and other issues here. I mean — so another issue is what or how often you should be able to publish no results, which is a bigger question, too. But yeah, I'm very wary of the idea of cut-offs. Andrew Gelman is a statistician at Columbia. He has this great really short article — if I can get the title right — it's something like “The Difference Between “Significant” and “Not Significant” is not Itself Statistically Significant”— and that's kind of a mouthful. For the listeners, it's probably hard to picture what he's saying. But what he's already saying is basically, that one example would be if one study with a p-value of 0.049 and another study with a p-value of 0.051, setting the same topic. Those two findings are not actually significantly different from each other. So there's no reason to privilege one over the other and even goes further and says that kind of arbitrarily small differences in your population and in your sampling, and so forth — that you can end up with effects that it looks like they have the values that differ even more than that, but are not actually significantly different from each other in terms of the two of weighing the two effects against each other. So by what he says I would say the logic of drawing a bright line sort of cut-off or threshold just failed in its own terms, because if there's no significant difference on either side, findings that are close to that cut-off, then the argument for having a cut-off kind of collapses, at least in that area, it gets really fuzzy. But then, should there be a cut-off for publishing? I just don't know. I mean, I think we need more null results in the literature. I think maybe we can debate about what you consider a positive result versus a null result? If the p-value is 0.20 rather than 0.1 or 0.5, is that null or does that just mean you have kind of less reason to think that the data or all that from chance, but evidethe nce is on a continuum? So I don't know. Because I'm uncomfortable saying that in some cases, we know that the data is different from chance. And in other cases, we know perfectly that it is from chance. I mean, I feel like it should be more of a continuous sort of gray area really, most of the time.

SPENCER: Yes, you bring up so many interesting issues. So first of all — just for those who don't know, the phrase "statistically significant" refers to a result that has p smaller than 0.05, and so that phrase, I think has created a lot of problems, because it's become the mean in many people's minds like, "Oh, it's just a snippet", it means I found something. And if it's not this significant means I didn't. I think you and I would both agree that the way evidence works has nothing to do with thresholds. It's a continuous thing. From the point of view of actual evidence of p = 0.04 and a p =0.05, and a p = 0.06 are all essentially just about equivalent because evidence comes on a continuum. And by introducing this phrase, "statistically significant," it's like, unfortunately, the way the human mind works tends to push us into this dichotomy of whether it worked or didn't work. But just below the line, a p = 0.05 would say, it worked. And if it's just above I would say it didn't work, and that's silly. But then there's a separate question of what should the standards in scientific publication be? I certainly agree that it would be great to publish certain kinds of null results. And you and I have actually discussed this a little bit before. I think that the general call for publishing all no results. I don't really buy the value of it because if you just pick random ideas out of a hat, almost all of them will be false. If you say, Ddoes x work for y?" Well, the answer is almost certainly “No.” Publishing a whole bunch of random ideas that researchers came up with, our prior probability on them is really low anyway, they're just random, then those — I don't think their publications really are valuable. What I find really valuable is if someone finds a null result for something that is believed to have a positive result, if they tried to intervention X for outcome Y, and it's generally believed that it works and they find it doesn't work, then that's a really important result. And that should definitely be published. Any reaction there?

STUART: I agree, I would not suggest that we publish all null results, because you're quite right. There's probably a near infinity of possible null results, both because you're asking a stupid question, or because you might have studied a reasonable question in an incompetent way and just fail to find the results that should have been there. I wouldn't want to publish all null results. But I agree that publishing null results where there's been a prior positive result is good. I guess I would also extend it a little further, I would say publishing null results could be valuable, anytime you're asking a theoretically interesting question — the question where the answer would help advance the field whether it the answer is yes or no or hard to tell, we need more exploration.


STUART: There's a lot of psychological studies that were looking at some questions that might not have been all that theoretically interesting, such as — I mean, just to exaggerate a little bit — but like, "Does being in a room with the color purple make you more or less anxious?" Is that an interesting question? If you find a null result for that question — I mean, people go first time, why would it? But if you find a positive result, then the psychological literature says, "Oh, wow, we found some priming effect on the human brain. And so then we've come up with some theory after the fact as to why the color purple has this effect". But if it had been some other color, who knows what the result would have been? I mean, it's not that sort of line of inquiry doesn't have a strong theoretical basis in the first place. It more strikes me again, not a psychologist and not to be dismissive, but it strikes me it's like throwing spaghetti at the wall and seeing what will stick and then publishing the stuff that stuck. It seems to me much, much better, if you build a science on asking important, substantive theoretical questions where the answer needs to be known one way or the other, then if it's a null result, and you did a good job. You didn't just like mess up the experiment some way. Then that null result should be informative to the field about what is it. What does or doesn't work to answer that particular question? Or what might be a dead end?

SPENCER: That makes sense to me. I would also add that sometimes there's really applied results, like "Does purple reduce anxiety?" If that is actually true, [laughs] that’s actually pretty useful. I do think there's a role for that, but I guess the way that I break that down (and I may have mentioned this on a previous podcast, I'm not sure I've been to many at this point) is I think about the most useful science worth funding as generally falling into three categories — the theoretical results that are very fundamental, which I think is what you were talking about. It's an important topic in that field, like just as an example in physics, that might be something like gravity. If we understood gravity better, that would be interesting because it's very fundamental, or in psychology, identity is a really important topic if we understood it better, that would be good. And then the second area I think about is just applied areas where you're like, "Well, if we had this result, we could just go do something very useful with it right now", like curing cancer would be the classic example or in psychology, it might be like making people less anxious, even if it's not interesting, theoretically, if we can make it less anxious, great. And then the third area that I think about being really important to find in science is stuff that makes sense go faster — tool building or new methods, or things like this that could support better science in the future. Not to say that there aren't any valuable scientific enterprises outside of these three areas, I think there are. But I think these are the three most important from my perspective.

STUART: So going off of that, in what area, if any, would you say null results are important to include, or perhaps less important to include?

SPENCER: Great question. So with the first category of fundamental results, often those that know what's actually important, because it's like, "Hey, we thought this thing about gravity might be true", but it's not like, [laughs] "Oh, cool", that kind of thing, people might be really interested in that. And maybe that even if your theory was somewhat esoteric, maybe that helped guide other theories away from that idea. In applied stuff, it's most useful when it's applied results that other people are already using in practice. If you could say, "hey, you know, I think what you're doing actually doesn't work", or we just generally believe that it works, then I think that's when they should be published. But if it's an applied result that nobody cares about and nobody's using, nobody believes — as an example, just picking a random color and seeing if it does a random thing. Those don't seem very valuable to me to publish. On the sort of, on the tool building and methods side there, I would just say, mainly, if you discovered that a method doesn't do what it claims.

STUART: Right. I think I agree. The first category might be where null results are the most valuable. I guess what I was trying to say earlier is that those other two categories, null results would still be valuable there if what was going on was thought to be based on some sort of strong theoretical basis. So when it comes to curing cancer, for example, you could have millions of possible null results. You could test every one of 1000s of different foods and spices in an RCT against every possible type of cancer.

SPENCER: Have you ever seen that chart that says everything [laughs] both cures and causes cancer?

STUART: Yes, that's an amazing chart. But, that's what you would end up with, basically, you would end up with a lot of just random noise. So those would not be substantively interesting or theoretically grounded questions. So publishing the 1000s null result that says, "hey, guess what, this random ingredient also doesn't cure this random type of cancer", that doesn't seem to be a great advance. But if there's some strong theoretical result for thinking that — I don't know, turmeric has some sort of curative, proper preventative properties as to cancer. So people don't only think for whatever random reason they have. But if there's some strong reason, maybe somebody has done some work on the molecular mechanisms that are destined for biology and working in the lab, and found that it does feel cells, then I guess that's what I'm saying. That's when it would be interesting to have a null result on that.

SPENCER: Right, because it kind of feeds back into the theoretical.

STUART: Right, exactly.

SPENCER: So let's talk about open science and reproducibility. Do you want to say a little bit like how you got interested in that, and that kind of what your role has been in that?

STUART: Sure. I don't know if I've said much about my role, generally. But so I'm at Arnold Ventures, formerly at the Laura and John Arnold Foundation — it's a foundation that's been around technically for about 10 years. I've been here for about eight of those years as director and then VP of research. And we got interested in this problem of reproducibility really early on, so I can in 2012 or so. The reason is that as a foundation, we're super interested in questions about evidence-based policy, questions about using evidence and government or just evidence to improve people's lives. And then we started noticing and John Arnold in particular, started noticing that there were these kinds of rumblings about reproducibility problems and psychology is where I think he and I both first noticed it around summer 2012.

SPENCER: Can you find reproducibility for those?

STUART: Sure. Well, I mean, there's lots of differences, you can get really technical, but the most basic kind of simplistic definition is somebody does an experiment; let's say psychology, someone who doesn't experiment publishes it. Other people try to do the same experiment, and try their best to see if it works. And it turns out, they can't get it to work, and so they say, "Well, look, that experiment wasn't reproducible, for some reason." You can then argue about what the reason is, maybe the second set of experimenters just didn't do a good job, maybe the first experimenter was publishing false results for some reason, or engaging in research practices that might have skewed the data in one way or another form. Maybe there's just some mechanism or factor that just isn't well understood. Maybe there's something different about the population that the second group studied, compared to the first group and no one knows what that is yet, and you need to figure out how different populations react to different psychological tests. So there's a number of reasons for this, but all of which is to explain why the first result, not reproducible in the second study.

SPENCER: I'll just add, there's sort of this hierarchy of replication, reproducibility, whatever you're gonna call it, where at the very bottom of the hierarchy, if you use the exact data and exact code that the original research team used, can you get the same result as that might just run their code on their data. We should hope so, right? And then you could be like, "Okay, well, what if, instead of using their code, I write my own code to do the analysis, but I'm still using the exact data set that they used?" And then you've got a level above that, "Well actually, what if I go recruit people that are just like the people that they tested on, and then I run their code on that data on but I've collected this new data now", and etc., all the way up to at the very top of the hierarchy got conceptual replications, where like, "I'm recruiting new people, they're not the same people. I'm actually not exactly doing the same study, but it's conceptually related. And so if their study produced a real effect, I'd expect my say, also produce a real effect, even though it's like actually testing something different". And so I guess we can be talking about anywhere on this whole chain.

STUART: Yeah.That's what makes it hard. That's when we say defined reproducibility, I was thinking, "Okay, which of which level?". Because there's so many different ways that people talk about it. And they often use the same. I mean, they'll use terms like replicability and reproducibility, sometimes interchangeably, sometimes do you mean one versus the other, and then some, sometimes the reverse in different fields. There's no kind of universal terminology here. But yeah, the very most basic is just rerun the same code and the same data, and hopefully get the same results. But even after that, most fields often fall a little short, [laughs] fortunately.

SPENCER: Is that mainly just because you don't have access to the data or code or because [laughs] even if you get it, you can't get the same result?

STUART: Often, it's because they don't have access to the same data and code. There are some fields or some journals, actually — I think the American Journal of Political Science is one group that started a program a while back to rerun the data and code for the empirical articles published. And they found that most of the time, at least for the first few years that they did find some errors, some mistakes, they had to be corrected. I don't know how many of them were actually huge, major errors. But it's really hard to rerun other people's data and code on a second computer sometimes because coding practices in academia are not the same as in industry.

SPENCER: Well, it's funny, you say that because I work with a lot of software engineers, and I'm a programmer myself. I've been programming since I was a kid. And I assume that even a good programmer will accidentally introduce a bug, every 10 to 30 lines of code, that's just a given. It's not about not creating bugs, it's about having really rigorous processes to catch the bugs before [laughs] they cause a problem, especially before you get your final result, or it goes into production, or whatever. A lot of good software engineering is about doing things that help reduce bugs, but also catch bugs when they occur, because they are sort of inevitable. One thing I see in academic programming is it just feels like they don't have a lot of those best practices and how to catch bugs.

STUART: Yeah, of course there's a huge variety in academia, some people in academia are great, and they're aware of these principles.

SPENCER: Absolutely.

STUART: Yeah, I think, too, a lot of academics, the idea of testing your code for bugs, or doing code review with somebody independent, would be like the fork, they've never never heard that. So the best practices still need to kind of make their way into a lot of academia.

SPENCER: It reminds me of that example, if I recall correctly — it was some kind of economic study related to the debt of countries and, and how can be problematic if countries have a large amount of debt. And it was being cited in Congress or something like this. And then [laughs] someone discovered that literally, they'd used Excel, Microsoft Excel. And there was one of these issues where they hadn't copied the formula to all the rows or something like this incredibly basic mistake, and it just completely overturned the finding when someone does fix the Excel error.

STUART: Yeah, that's — you're talking about the Reinhart and Rogoff issue. Two scholars at Harvard, supposedly prestigious university. They wrote this famous article in the late 2000s, about the height of the kind of economic crash, and published a book that followed up on these claims that if you have debt that's too high that fills your economic growth. I think it was a either a grad student or postdoc who somehow got them to send him the Excel file [laughs] that they use. And yeah, you're exactly right. They were analyzing, I forget what it was, something like 21 countries and they just left out some of them, because they didn't copy the formula into the right rows. So yeah, kind of amazing.

SPENCER: But you know, I have a hard time being tough on them about that. Because, as a mathematician and a programmer, I make mistakes like that constantly. It's really just about me trying to make sure to catch them. But inevitably, occasionally one is going to slip through. Almost nobody is so good that they never accidentally let a result. Something like that slipped through — I think they were really unlucky that this was like, with a really big finding. And then they published a book and all this stuff like, that's disastrously unlucky.

STUART: I don't know, I guess I might be a little harder on them because I mean, yeah, it's, obviously, it's easy to introduce bugs into code. I mean, you write something in R and, you know, you have a quotation mark, and we're missing a quotation mark somewhere, I mean, just there is a comma that's in the wrong place. And it spits out some completely inscrutable error message at you. So yeah, obviously, that can happen to anyone. But I mean — come on, this was an Excel chart that was pretty simple. I mean, this is if you're working in Python, and you can't do your first little “Hello, world!” command, you missed that.

SPENCER: Hopefully, [laughs] then maybe you're much more precision operator than I am with this stuff. But I've made so many little mistakes when I'm doing calculations.

STUART: Fair enough.

SPENCER: But on the other hand, if you're gonna go publish a book about it, double check your work, seriously, double check your work.

STUART: This is not looking through 1000s of lines of code. I mean, it's just an Excel chart. It was basically on one screen. I mean, how hard is that to double check?

SPENCER: Yeah, and then, with great power comes great responsibility, right? You've got up your game and double check, everyone's gonna make mistakes, you got to double check. But I think one thing that people don't realize is that if you write some code to do an analysis, and it kind of gets the result you expected, you're probably not going to double check that code. But if it doesn't get, you expect you're gonna be like, “There must be a bug somewhere, like, I'll find it". So I wouldn't be at all surprised if there are a lot of these results where they got what they kind of thought they were gonna get, and if the code [laughs] actually doesn't work at all.

STUART: That's a good point. To broaden that point out a little bit, I think that any expert thinking you're thinking about, so that's it. And she made this great point in there that we often don't question our successes. We do something that succeeds. We're all too happy to say, "Yeah, we did a good job.", and if we do something that fails, then we go back and say, "Well, okay, post-mortem time, let's see what happened. What did I do wrong?". And she said, like, the best poker players, as I recall, sure, if you lose a hand, you go and question, what could you have done differently. But the best poker players were also taken, successful hand where they won, and instead of just feeling good about it, they say, let's go back and look, because maybe I made a decision that actually was not a good decision at the time, and I just got lucky. So you need to be questioning your successes, as much as your failures and do a do a post-mortem in effect on your successes to try to figure out if your decision making process was actually going well. So I would say that should apply to science in the code as well, if you do something that “works” then maybe you were lucky, maybe it's a fluke. Maybe you should actually still double check and make sure that you are making the right choices.

SPENCER: Totally, yeah. And adopt unit tests, which is one of the best practices that engineers do, and we try to use them on our engineering teams — and I really think scientists could benefit from implementing unit tests where you basically write code that's designed to test your code. It's sort of indispensable.

STUART: Yeah, totally.

SPENCER: But let's go back to Open Science, so I derailed you. [laughs] Do you want to keep telling your story? Yeah.

STUART: In terms of our foundation, we looked around and said, "Wait a minute, we want to be evidence-based in philanthropy and what we believe about public policy, from criminal justice, to education, to health, etc.." What do we think about this fact that maybe a lot of research isn't all that reliable, or reproducible, maybe just the positive stuff is getting into the literature, and maybe some of it isn't true, or won't turn up again, for whatever reason? And maybe there's a lot of null results or unexciting results that actually weren't true, but no one wanted to publish them or bother to publish them. And so take an example, like, if you wanted to support education reform, or curriculum reform in schools, and there's a lot of publication bias and a lot of bias and the way that experiments are run and analyzed, then you could be completely misled as to what actually works to improve kids outcomes in schools, because the stuff that didn't work, people didn't bother to publish it. The stuff that supposedly works, they published that but then not so much as the failed replication. So how do you know what to find as a philanthropist if you're trying to bait base your decisions on evidence, and you're not sure what to trust in the evidence? That's the kind of origin of our interest in this broader meta-issue of “I didn't know what to trust in science, how do you improve the processes of research and how do we get the academic journals to prioritize things that might improve the reproducibility of research?”

SPENCER: You would think that many large foundations and other institutions including ones like the US government would have a huge vested interest in making sure science is reliable because so many decisions and fund allocations are effectively based on studies.

STUART: Yeah, you would think so. It's not totally absent. I mean, there are some other foundations, the Sloan Foundation, the Templeton Foundation, a few others that have had some interest in this — there are certain kinds of small areas for small kinds of grant making projects that NSF, for example, the National Science Foundation, they had a call for papers on replication at one point. I don't know if there was a ton of money compared to their overall budget. NIH (National Institutes of Health) which is the main biomedical funder here in the United States is interested in this issue, for example, they just released a rule on data sharing from studies that they fund in biomedicine. So yeah, funders have some interest in this, but it is kind of a meta issue. I know I've talked with people who are high up in leadership at NIH before, and they will candidly say off the record that it's tough to get much funding for this kind of inquiry that what you see is successful if NIH is pushing out lots of money to support biomedical research at labs. It's really hard to pull back and say, "Wait, let's give less money to biomedical research labs and more money to questioning the work of what [laughs] the other biomedical research labs are doing."That introduces a kind of a level of controversy, and it seems less exciting than wishing for the next big new thing in whatever field you're in.

SPENCER: It just seems so valuable because we now have abundant evidence that substantial portions of findings are wrong. In the social science realm, you've had these big attempts to replicate lots of different studies in top journals. My takeaway from that is that if you randomly sample from top journals, probably something like 40% of the main findings will not replicate. Is that kind of roughly the sense that you have as well?

STUART: It's hard to say. I mean, I don't know what would happen with random sampling. One of the earlier projects that we funded in this area was the reproducibility project in psychology, which was published in Science in 2015. What they did was they took 100 original studies that had been published in a handful of top Psychology Journals in 2008. They found that when they tried to rerun the experiments with hundreds of scientists around the world, collaborating with them, that only about 36% to 39%, in their view, were successfully replicated as it found the same effect. In the second experiment, the rest are a little harder to interpret, but one way of interpreting it, was it another 30% or so were kind of inconclusive, and another 30% or so of the experiments, you would say it just really didn't stand up. The first experiment just didn't really replicate in the second experiment. But here's your question, “What's the average across science?” That's a really difficult question to answer for me. It really varies depending on the context on the field. Yeah, so I don't know what you would get from a random sampling that is very context dependent.

SPENCER: I agree. It's very hard to tell. And there's so many complicated factors. In that study, there were those studies that were sort of inconclusive on the one hand. Another thing is that not all assays were perfect replications, because it's just not always possible to do a perfect replication. So you always have a danger that the replication attempt didn't faithfully reproduce the correct concept or the correct design or whatever. That's a sort of muddying factor, then you have the fact that there's a selection bias, where people are choosing which ones to replicate. They're kind of volunteering to do it. So maybe they're picking ones they're easier to do, which means that maybe some of the harder to do studies are just being left out together. I don't know how to make sense of all this to get one number, but I think a lot of people were disturbed by the percentage that didn't replicate in this trial.

STUART: Yeah, absolutely. There's a lot of reasons that something may or may not replicate. I think, at a minimum, what you can take away from this is that we often don't have a very strong theoretical understanding of what's going on in science. Because one of the most common explanations for why something experiments a minute replicating experiment B, is that well, something must have changed, "Okay, well, what was it that changed? And why did that matter? And if that is indeed the factor that matters? And what does that tell us about our understanding of what's going on there?". In the reproducibility project in psychology, in one case, an excuse was made that — well, the original study was done in Florida, and the replication was done in Virginia, or something like that. And just in a different college town, it's laid out a little differently.

SPENCER: No, man, [laughs] if psychology is that brittle, we're screwed.


STUART: If you can't generalize from Florida to Virginia, or from one time to another, then that really calls into question, "How reliable can you generalize your sciences?" but it also calls into question, "Well, how much do you really understand about why certain psychological effects take place or not?".

SPENCER: Some people might say, "Well, maybe it's just social science that's messed up and has this problem." Do you want to talk a little bit about the sort of biological results as well?

STUART: There's something that I mentioned to you once before, which is that even biology experiments, which social scientists might look at and say, "Well, that's hard science, maybe maybe that holds up better," but that's not necessarily true. So, well, one example — Marcia McNutt, who was the Editor-in-Chief of Science and who's now the president of the National Academy of Sciences. She's fond of telling the story, at conferences, when she's talking about reproducibility about some lab that was studying mic. I'm not sure what it was about, she doesn't really say. There was a study in mice in the lab in Baltimore, and they got a particular result. Then there were other labs, maybe in Ohio, or somewhere else like that. They tried to replicate it, and they found a different result. And then another lab in Baltimore, probably Johns Hopkins, did this study again, and found the original result, and she jokes in conferences, "People who were starting in the field, were starting to wonder — ‘Is there some sort of Baltimore effect on this question?’” But then they ended up finding two kinds of collaboration and talking to each other. Finding that the wood shavings at the bottom of the cages that the mice read or the mice or rats, that those were a different type of wood in Baltimore, than they were in the other case, maybe one was cedar, one was some other kinds. It turns out that affected the way the mice or rats reacted. So it turns out that there are these kinds of subtle differences. There was another case that I mentioned to you earlier, that involves a study of breast cancer cells. And so this scientist wrote in — I think it was Nature, or one of the Nature journals — about an attempt to collaborate with another lab across the country, on this study of breast cancer cells, and they kept getting different results, they could not figure out why. This went on for literally like a year, they tried to make sure that everything about the experiments was the same. And were just basically putting their heads against the wall, trying to figure out what is going on here. Finally, when they traveled back and forth, and tried to observe what exactly is going on in the lab, they found out that at this particular point in the experiment — and look, I'm not a molecular biologist, if any of your listeners that are there, they're going to be able to tell that I don't really know what I'm talking about here — but they were processing the tissue by mixing it with something called Collagen A's digest, to try to separate ourselves from the tissue. So at the lab, one of the labs that was in Berkeley, the tissues were digested in tubes that had half the concentration, and were stirred on this platform at a very slow speed for a long time at 24 hours. Whereas in Boston — where the original term was — it was stirred with his Collagen A’s mixture, whatever that is, at a higher rate for about 68 hours. So that was the one difference between the way these two labs are doing experiments, just how it is just one particular technical point, they had to start something with a solution. And one lab started faster for a shorter period of time, when that started, what's more slowly for a longer period of time. That was it that once they fix that, then they get the same results. But I guess to go back to what I was saying, the fact that such a kind of technical — one might think minute difference in the experiment — in fact has a dramatic result on the outputs, on the effect. I think that's something that needs to be built back into our scientists theoretical understanding of how the world works. It's hard to produce generalizable knowledge if everything you do can suddenly be thrown off by something you didn't even anticipate or know as important, so you need to broaden your kind of theoretical understanding and explanation of what is important to experiments and why. But that feeds back into the reproducibility problem, too, because if the excuse for every every turn that doesn't replicate is, "Well, it must have been one of these unknown factors," well, that just means that — I mean, it could just be an excuse — but at best, it means that, well, we just don't understand what we're doing here. Like, there could be any number of unknown factors that we've never measured, never thought of that or that make everything unpredictable. That's not very comforting as an excuse, when you think about it.

SPENCER: It also speaks to the great importance of material sharing. When a lab publishes a paper, often there's not enough detail in the paper to fully redo their experiment. You can try to approximate their experiment, but at the end of the day, if it's a biological study, you don't know exactly how long they put that thing in the machine and maybe if you put it in a different amount of time, you might get a different result. In psychology, you often find that you can't find the exact protocol. So you don't know, "Well, what wording did they use?" You know, “How exactly did they present this information?" You just get this course approximation of it.

STUART: Yeah, I think fully disclosing the methods is super important. It is difficult because again, I think sometimes what's important about the methods is something that people didn't even think of until the rubber hits the road, and they realize, "Oh, something doesn't match here, like what now, we need to figure it out". Obviously, in any case, I think the method sections could be really strengthened and made more detailed. I'm just saying, I don't know, they would fix everything because sometimes the biggest kind of method problems turn up unexpectedly. But it's something that no one thought was important until for nothing was.

SPENCER: I feel like there are two different issues here that can be easily conflated. One is that reality just can be really complicated — it can actually matter, how long you stir a solution or something like that, and that can actually change or result. And if you're trying to develop a drug, you might actually need to know, how many minutes do I do I start a solution for — things like that. And that sort of this irreducible complexity of reality that makes doing science harder, but we just have to account for. On the other hand, there are kind of bullshit excuses where it's like, "Well, maybe that's because in Florida, psychology works differently", or something like that, which I don't really buy at all, and I think sometimes, the world truly is really complex. But sometimes scientists might hide behind that complexity and kind of blame what are actually just bad research processes on the kind of inherent complexity of everything.

STUART: Yeah, I think that definitely can occur. Absolutely, the world is very complex. But I think that — at most, I think it's scientist whose research fails to be replicated, should at least start questioning, "Okay, how much do we really know about this subject?" and should approach it with a little less — especially in certain fields of psychology — a little less kind of grandiose pronouncements about we now know how humans think about X, Y, or Z. You studied a class of undergrads at your university, how that class of undergrads thought. Maybe you don't even know that, because maybe your research techniques aren't very good. But to generalize from that to all of humanity, maybe we should be a little less grandiose about those types of claims in the first place. And then we wouldn't have to later back off and say, "Well, actually, I guess it is more complicated. And people do think differently, or have different cultural expectations, or whatever it is that might influence how they psychologically respond to something".

SPENCER: I think the generalizability issues are really big one, where you get a result on a particular population with a particular intervention. And then you draw a generalization about it, and you say, "Ah, well, now I’ve learned this general principle, and that means I could go apply it to any population, and even use it in a very different format," and that doesn't necessarily hold. One way in my own work that I've tried to deal with this is by actually trying to build tools, study them on populations that are very similar to the real populations, they're going to use them in the exact format in which they'll be used. If we build a digital intervention for anxiety, for example, like our mind is, we can go study it on a population that's very much like the real user, and it's the exact same intervention. So it means that instead of solving the generalizability problem, we're kind of skirting it where we don't actually need to solve it.

STUART: That's interesting. So you're saying just by changing how you sample things that are experimental subjects that you're saying that kind of circumvents the generalizability problem?

SPENCER: Yeah, because if we can study an intervention on people that are basically very similar to the people that were actually using on and in the exact same format, that we're going to use it, then all we need to know is that it [laughs] works on that population we studied. We don't necessarily need to know that it works for anyone in the world when administered through any format.

STUART: Yeah, that's interesting. I wonder how broadly that sort of idea would be applicable or useful across different fields. Let's say, education or international development. At international level, when you need to do a study of whether it's cash transfers, or benefits from malaria, etc., you're not dealing with a random sample of the world, or even a random sample of developing countries. It's typically done in some particular context. So then, any basic question is a study from Kenya apply to Rwanda? Or is a study from India applied to Bangladesh. So I think those questions are kind of always going to be with us.

SPENCER: I'm really interested in this idea of integrating lightweight but relatively high quality studies into doing other things. So it's not always logistically easy. Sometimes it might be really, logistically difficult. But imagine if you're going to do cash transfers in a particular country, if you can actually weave a lightweight, randomized controlled trial into the process. So add some randomization into the normal process of giving out that money, for example. Suddenly, now you're collecting hopefully high quality data on that exact population, to see whether your intervention is actually working, rather than doing a study on one population with some amount of money and then hoping generalizes in this other area with a different amount of money on different sorts of people.

STUART: What you're saying is trying to do an RCT basically everywhere?

SPENCER: Weave them into whatever your general processes are. We use this concept in some of our work, where we'll just have a never ending randomized controlled trial, whenever someone uses the product, we might do a randomization and then we're just reflecting no follow up data, and so it's not like, "Oh, you go do the study once and then you generalize from that.", it's like, "No, your intervention itself is also a study simultaneously."

STUART: So would this mean things like program improvement, so to speak, so I assume you're talking about it's not randomizing the program or your case may be an app or something against nothing, or against a placebo, or against some other sort of form of treatment that probably doesn't work, but might kind of be the equivalent of a placebo? I started using placebos in the real world, but instead, are you talking about some type of offering one variation of the program versus another, just to see if it can be improved?

SPENCER: Right. So early on, you can often do it with a placebo, as well. But you might have to recruit, get people to consent to say, "Okay, you might be getting a placebo." So an example of that is, if you were doing cash transfers, you only have enough money to give it to, let's say, 1,000 people. So if you only have enough money to give to 1,000 people, you could enroll 2,000 people, but tell them, "there's only a 50% chance you're gonna get this payment", right? And then after you do that, if you have shown in that population, you're getting good results, then you can move to the other type, where now you have different versions of your intervention. Now, maybe half the people get this amount of money and half, get that amount of money, and now you're starting to study more nuanced questions, or half of them get money plus an educational intervention, and half of them just get the money and so on.

STUART: Right. Or I mean, it could matter, maybe some people get it once a month, some people get it split up into four weeks, maybe people have budget constraints that are better served by one or the other. There's all sorts of questions about program improvement that matter. But I'm still wondering, “How does that tie back into the generalizability question?” Because if you do that study in, let's say, Kenya, and you start doing all sorts of program improvement type questions, you might end up with a formulation of cash transfers that works really well for the people that you're giving it to in Kenya. But then if you want to do it in Nigeria, don't you still have that same question? Well, how much — in fact, maybe it's even worse, in a way. I don't know, I'm just thinking out loud, maybe it's even worse, because now you've come up with a version of the program that is very specifically tailored to one group of people, or one population in Kenya in this example. And maybe now it's going to be even more unlikely that that generalizes to somebody somewhere else, because maybe it won't be instead of just the basic bare minimum, local, lowest common denominator version of the program that probably works a little bit everywhere. Maybe now you've got a highly tailored, highly specific program that just is a mismatch somewhere else. I don't know. I'm just making it up. But couldn't that be the case?

SPENCER: Well, the idea is that you don't need to generalize. If your intervention is designed to target a specific population, you're making sure it works in that population. And then if you were going to move it to another area, ideally, you're doing exactly the same thing. You're weaving in some form of randomization, to also be doing a trial there. The idea of running a study and of delivering the intervention actually intertwines. And so they're just you're always running a study whenever you're giving an intervention.

STUART: I mean, yeah, if you introduce any program or intervention or policy in a new place by thing, we're going to start out with the kind of baseline RCT to see whether it works or not here, and then if it does, then move forward with the other kind of more improvement type questions, then yeah, I agree, I guess you're avoiding ever needing to generalize in that sense. But I guess the cost of it is you have to have the capacity and willingness to keep doing RCTs forever.

SPENCER: That's why the key sort of aspect there is can you make it lightweight enough and just integrated in the deployment of the intervention itself. So rather than thinking of it as this two-stage process, where you go run this really expensive RCT to see if that thing works, and then you roll it out. It's like, no rolling it out and running RCT are kind of so intertwined, that you just kind of just baked in the process. So I'm really interested in that approach. Because I think it's an interesting way to solve the generalizability problem, and also to iterate to make interventions better and better. So they're not the static thing. They actually keep improving, because you effectively can continually A/B testing as occurs with sort of any software product.

STUART: Right. Well, look, I mean, I think what you're saying makes a tremendous amount of sense, because one of the big problems is that generalizability is really hard. [laughs] Do you know Eva Vivalt?

SPENCER: I do. Yeah, she's written some really cool stuff about generalizability.

STUART: Yeah, well, she's analyzed several 100 papers on a bunch of different types of interventions in international development. And her findings, basically to be simplistic letters, is that, as I recall, if you know, one RCT on a particular type of program, and you want to try to predict what the RCT — what's going to be the findings of the next one randomly picked — you might as well be flipping a coin, than knowing any one particular study results doesn't really give you much information about the next study results. Because sometimes results that are significant one place, there's an insignificant other place, (not to draw a bright line), but sometimes the sign flips. Sometimes the program that works in one place, like turns out, it seems to have a negative impact somewhere else. So if all you knew was one of those studies, you'd have a hard time generalizing to a different place, and actually figuring out what might happen with that just speaks to the need to continually test and improve in all those places, rather than just taking knowledge from somewhere else, and hoping it works.

SPENCER: I want to try to make a list of all the reasons that studies don't tend to generalize. And it's quite fascinating. There really are so many different reasons, I think I came up with 12 of them [laughs] why can I fail to generalize, ranging from like, "Oh, the original result was just noise, you know, it just wasn't even real in the first place,". Two, you can have a quality issue, maybe the study tests are really high quality, but then when it gets rolled out somewhere else, there's much lower quality, because as you touched on very briefly before, a bad implementation of any intervention will always fail. And I think that people don't always recognize this. They're like, "Oh, you know, someone tested meditation for anxiety, and it didn't work. So meditation, for example, doesn't work." Well, okay. But a bad version of anything will find no effect. It can be hard to generalize that way. Another one is that often the dosage differs, so let's say it's an educational intervention. Maybe the study did a two hour intervention, and then when they rolled it out, it was half an hour, "okay, well, that's a big difference. Or maybe it was two days, and then they made it into a three hour thing". So there's just so many of these different issues. And then of course, there are cultural factors, right? There's certain things that just wouldn't fly at all, if you try to introduce them. Let's say, you have some kind of condom use intervention, where you're trying to put in an area where people are actually morally opposed to using condoms, like, "Okay, that's probably gonna be a problem".

STUART: Yeah, absolutely. There's so many reasons. And I feel like that's something where there's a growing area of literature, but I think that it's something that deserves a lot more study to understand when and why something might generalize or not. Because I mean, as much as I would love it, if you could do RCTs everywhere, on everything, there's going to be a lot of places where you can't where there isn't the political will, or there's the capacity to do it. I mean, there's just a lot of reasons. So I think it'd be good if we had a more nuanced understanding of when and how reliably, can we take knowledge from one place and use it somewhere else.

SPENCER: And I feel like we might have let some other fields off the hook a little bit too much. Because I really think that these kinds of issues of reproducibility credit across many areas of science and I remember from an article you wrote, it was really interesting, called Escaping Science’s Paradoxes, you talked about and Amgen and Bayer. Would you remember the result there?

STUART: Yeah, absolutely. Those are pharmaceutical companies that eight or 10 years ago, both published articles, in which they claimed that when they tried to reproduce academic experiments in their own labs — hoping to find it works I mean — these are not pharmaceutical companies that are not out to get anybody at this stage of their research. It's not like they're trying to be the gotcha police. They're hoping something works so that they can take it further and develop it into a drug. And most of the time, they said it wouldn't replicate. And there's nothing they could do to get it to work.

SPENCER: I think it was, like 70% or 80% of the time, they couldn't get the result to work in their lab.

STUART: I've happened to meet folks from other pharmaceutical companies as well, Pfizer who say the same thing, and this is anecdotal, but they'll say, "Yeah, about two thirds of the time, when we try to replicate an academic experiment, we just can't get it to work". And again, it could be that there's always some mysterious factor like what kind of wood shavings are in the bottom of the cage of the mice that you're studying. It could be the rate of stirring matters. I mean, it could be, but again, if the academic literature in those areas is so full of those unknown hidden factors that no one understands or can predict when they'll matter when they won't. Then it's really hard to have a kind of progressive adding to knowledge in a way that actually benefits humanity, rather than just adds to someone's list of publications. I mean, what we should want is not just more publications, what we want is more publications, and particularly in that area — cell biology, etc, — replications that lead to greater understanding of how the human body works, and ultimately greater ability to prevent diseases or curing diseases or address aging and issues like that, and you can't do that if the whole enterprise is so infected by either bad research practices or hidden factors that no one knows what they are and can't identify, except, in the abstract in retrospect, when something doesn't replicate.

SPENCER: Well said. Stuart, thanks so much for coming on. This is really fun.

STUART: Yeah, absolutely. Glad to do it.


Click here to return to the list of all episodes.


Sign up to receive one helpful idea and one brand-new podcast episode each week!

Subscribe via RSS or through one of the major podcast platforms:


Host / Director
Spencer Greenberg

Josh Castle

Audio Engineer
Ryan Kessler

Uri Bram

Janaisa Baril

Lee Rosevere
Josh Woodward
Broke for Free
Quiet Music for Tiny Robots

Please note that Clearer Thinking , Mind Ease , and UpLift are all affiliated with this podcast.