October 22, 2022
How can we change the way we think about expertise (or the trustworthiness of any information source) using forecasting? How do prediction markets work? How can we use prediction markets in our everyday lives? Are prediction markets more trustworthy than large or respectable news outlets? How long does it take to sharpen one's prediction skills? In (e.g.) presidential elections, we know that the winner will be one person from a very small list of people; but how can we reasonably make predictions in cases where the outcomes aren't obviously multiple-choice (e.g., predicting when artificial general intelligence will be created)? How can we move from the world we have now to a world in which people think more quantitatively and make much better predictions? What scoring rules should we use to keep track of our predictions and update accordingly?
Peter Wildeford is the co-CEO of Rethink Priorities, where he aims to scalably employ a large number of well-qualified researchers to work on the world's most important problems. Prior to running Rethink Priorities, he was a data scientist in industry for five years at DataRobot, Avant, Clearcover, and other companies. He is also recognized as a Top 50 Forecaster on Metaculus (international forecasting competition) and has a Triple Master Rank on Kaggle (international data science competition) with top 1% performance in five different competitions. Follow him on Twitter at @peterwildeford.
JOSH: Hello, and welcome to Clearer Thinking with Spencer Greenberg, the podcast about ideas that matter. I'm Josh Castle, the producer of the podcast, and I'm so glad you've joined us today. In this episode, Spencer speaks with Peter Wildeford about expertise, prediction markets, and precision and forecasting.
SPENCER: Peter, welcome.
PETER: Yeah, thank you. It's good to be on the podcast.
SPENCER: Yeah, it's great to have you on. So I know that you have this idea that we should be thinking about expertise differently, that we should be structuring how we decide who [sic] to trust and how much to trust them differently. So why don't we start there? Want to jump into that?
PETER: Yeah, definitely. Let's go.
SPENCER: So how can we change the way we think about expertise, using the idea of forecasting?
PETER: Yeah, so my idea, basically, is that when deciding who we should trust about a topic, rather than using the traditional markers of prestige, like who went to a fancy university, or who was on the television on CNN, or who does talks about this, that we should mainly use people's prior track record. Basically, who [sic] would you trust more? Someone who went to Harvard but the last five things they said were completely wrong? Or someone who didn't go to college at all, but when you look at the last five things they said, they made a bunch of predictions, and every prediction came completely right. And maybe they're also pretty wealthy, because they do really well buying stocks or something. So they have this strong track record, even though they don't have the same level of prestige because they didn't go to college at all, and the other guy went to Harvard.
SPENCER: So how broad do you consider this idea? What are the different domains that you would want people making forecasts, and then kind of tracking the trackers of them?
PETER: I think I would want this done in basically every domain that we could do. At least I'd want more of it than we have now. I'm not necessarily saying that we should fire all the doctors at the hospitals and then replace them with forecasters or something. But I think when people are opining about politics or geopolitical events, or making important decisions, we really should give more voice to people who have a track record of making good decisions in the past and are thinking kind of quantitatively about things, and articulating precisely about what they mean, and have a clear demonstrated track record of doing well with the things they say.
SPENCER: Is there an area you see this being done already, that we can use as a model?
PETER: I think the best area that this is being done well right now is thinking about COVID and how to respond to COVID. There's definitely a lot of misinformation out there on the internet. But there's also a lot of bad information coming from traditional sources that I would have normally expected people to trust. And I think that lay people that don't have epidemiological background, that haven't gone to fancy colleges to get fancy degrees for studying this stuff, are applying more typical forecasting techniques, and seem to be outperforming the experts when it comes to making clear and accurate statements about what's coming up next in our pandemic, and how we can best respond to it to keep ourselves safe.
SPENCER: Are these forecasts being aggregated somewhere or just sort of more informal that you're tracking them?
PETER: I think some people check them informally. But one place I'm really excited about is Metaculus.com. That's M-E-T-A-C-U-L-U-S, Metaculus. They’re a website, where there's a bunch of people right now, making a ton of predictions about everything – COVID included, but everything from politics to sports to culture to who's gonna get a Grammy – and people have live track records. And (Metaculus) aggregates people's track records and comes to overall conclusions about what's going to happen. And I think it's a really great source of information to figure out what might be happening in the future, better than any newspaper I've ever read.
SPENCER: How would you compare that to prediction markets? And could you explain, for the listener, what a prediction market is?
PETER: A prediction market – you might see, like predictit.org – there are websites where you can go there, and you can bet real money on events, and you're basically buying a binary option, where you own this piece of paper (it's not a real piece of paper, of course, but) it will become worth $1 if the thing you predicted actually happens, and it becomes worthless if the thing you predicted doesn't happen. And then you buy this at a fraction of $1. So maybe you would buy it at 20 cents or something. And so if you're buying something at 20 cents that is going to be worth $1 if it happens and worth zero if it doesn't, you're basically implying that you think there's 20% odds on that happening. And then prediction markets just aggregate a bunch of information like this through the buying and selling of these options, and then you're directly financially incentivized to be more accurate than the market. And that's really cool, because then that pushes the market towards more accuracy. And the people that do poorly, they lose all their money, and then they don't bet anymore. And the people that do really well gain a ton of money and bet a lot more, and you just get to more accuracy over time.
SPENCER: So just to give an example, before the recent US presidential elections, there were some of these markets where you could bet on, will Trump win the presidency? And basically, that'd be trading at some price, like maybe trading it (I don't know) 50 cents or something. And then you could say, well, if you think there's more than 50% probability that they will win, you could buy that option. Or if you think that he's more likely to lose, I think you can short that option to bet against it. And then you make money based on that.
PETER: Exactly. So that's a good way to make money with prediction markets. And then Metaculus that I was talking about earlier, it's not really a prediction market. You're not betting money. You're just instead making predictions straight up. And I guess you're sort of betting your reputation. Because if you're wrong, it's kind of an invisible track record, but you're not actually losing cash. And so there's definitely different pros and cons to each. But I think they both do a really good job of aggregating information and giving you overall likelihoods on outcomes.
SPENCER: Yeah, I think it’s worth talking about the pros and cons, for a second, of prediction markets versus this other method. Does the other method have a name that Metaculus uses?
PETER: I don't think it has a name. I've been calling it reputation markets.
SPENCER: Okay. But you’re basically logging your prediction for essentially internet points, right? If you do well, points on a system. And if you don't do well, you lose points, but there's no money on the line, just your reputation. And you might say, “Well, that's kind of silly. Why do people care about points?” Well, one of the cool things about it is it means that they can produce better predictions by weighting the predictions differently based on how good the person's track record is. So someone who has a really good track record of making good predictions counts more; someone who has worse predictions counts less. And so okay, maybe some people don't care about the points. But that's actually fine because the people that do care about that are the ones that are motivated by them and work really hard to make good predictions. They're going to count more.
PETER: Yeah, definitely. And there's the leaderboard, where you can see who are the 100 best predictors of all time. Not to brag or anything, but I'm currently clocking in at number 40. I feel pretty good about that. And so basically, if you have this clear track record, these are basically the people I want to be listening to more so than the people in the news or the latest Axios article or something like that.
SPENCER: So let's talk a bit about the pros and cons of these different systems. As I understand it, one of the challenges with prediction markets is that you need enough liquidity or enough trading volume. If I'm gonna go buy an option that says a certain person is going to win an election, there has to be someone, a counterparty on the other side, who's gonna take the other side of that bet. Now, some approaches try to get around this, like you can have automated market makers and things like that. But this does really present a problem because it means that if there aren't enough people interested in making that exact bet on that exact platform, then you just can't make the trade.
PETER: Yeah, that's definitely an issue with prediction markets. You definitely really want there to be a very liquid market where you can buy a lot of shares at a lot of different prices. And sometimes markets that are kind of really thin and there's not that much liquidity, prices can bounce around a lot. And also, there's really kind of limited opportunities to make a profit, which then limits the financial incentive of being right.
SPENCER: Furthermore, it doesn't do the aggregation we want, because you really want the price to represent the aggregate view of lots of lots of bettors. But if it's just a small number of bettors, then it's actually just the opinion of a few people, right?
PETER: Yeah, definitely.
SPENCER: So I think historically, this has been a big problem for prediction markets, because it means that the things you can bet on are kind of boring. You can bet on maybe big sports scans, and you can bet on who's going to win major elections. But with Metaculus, you have a much wider range of things being predicted because you can lodge a prediction, regardless of how many other people are predicting. You don't need someone on the other side of that transaction.
PETER: That's definitely right. Prediction markets have tended to orient towards more flashy headlines, mainly political ones, though I have been seeing some more interesting markets. We've talked about PredictIt, but there's also Polymarket, which is kind of a crypto-based prediction market, though you still trade real dollars; you don't trade Bitcoin or anything like that. And they've been diversifying some of their markets. And then there's this new prediction market, Kalshi, K-A-L-S-H-I, that launched in the middle of 2021. And they've been adding a lot more diverse markets, not just about politics, but also about economics and economic indicators, and various others, like climate and weather, and world events and things. But yeah, I definitely think Metaculus has the largest diversity of topics because they're not limited by having to make a market.
SPENCER: Right. Another kind of advantage or disadvantage of these different systems is the ability to bet on things that won't necessarily resolve, or that won't resolve for a long time. Let's say there was a bet you wanted to make about technology in 10 years, like how fast will AI technology advance? It's very hard to do that in a prediction market, because 10 years from now, if you think about how right you need to be, in order to make it worthwhile to lock your money up for 10 years in that bet, you have to have an incredibly huge payout to make it worth it. Whereas with something like Metaculus, they can throw on things that aren't going to resolve for 10 years. And they can even put on things that nobody will ever know for sure. But we can use kind of the track record historically of the betters to try to make an inference about what might be true even though we will never know for sure.
PETER: Yeah, I definitely like that. I definitely personally value long term predictions a lot because I want to know how the state of the world will unfold over my lifetime and after my lifetime. And PredictIt, Polymarket, Kalshi, they tend to only really be useful on a one- to two-year time horizon because of that annual rate of return issue you were mentioning.
SPENCER: Another difference is transaction costs. In order to trade on a prediction market, you have to actually give some fees generally to the prediction market; that's how they make money. And so what that means is you can get in a weird situation where you think the price of an asset on the market is wrong. In other words, you think people are misestimating the probability according to the current market price, but it's actually not worth it to trade it because if you take the amount of fees that you'd incur to make that trade, it's like you're not gonna make enough money.
PETER: Yeah, I found that to be a particular issue, like betting on PredictIt on outcomes for the 2024 political elections. I think that a lot of the prices seemed skewed by my intuitions, but it's just not worth it to bet on it after the fees and after, mainly, the long time horizon.
SPENCER: So we talked about some disadvantages of prediction markets. What do you think the disadvantages of these reputation markets like Metaculus are?
PETER: I guess the core advantage of a prediction market that Metaculus doesn't have is, in a prediction market, you're actually betting on things that matter, like you have a direct stake in the outcome, a direct financial stake. Whereas in Metaculus, if you make a prediction, and it goes wrong, you can just kind of quietly disappear, and nothing. You don't really face any consequences. So I think in prediction markets, everyone has skin in the game. And so then you know that they're taking the outcome a lot more seriously. And also, you know that resources are being directly distributed from bad predictors to good predictors. Whereas on Metaculus, you don't have that skin in the game phenomenon. I think like (Nassim Nicholas) Taleb had the book “Skin in the Game” where he talks about the importance of being able to wash out bad predictions, by making people kind of go away if they don't predict well, and there's less of an effect of that on Metaculus. And there is when you're directly losing money. I think one big problem I saw on Metaculus once where there's this question about something about Elon Musk. I forget exactly what, but it somehow got really popular. And there were like thousands of new predictors piling on that question. And they were making really bad predictions because they weren't really used to predicting, and it was just really skewing the stated outcome. And those bad predictors don't really face too many consequences, though I suppose you could then just weight all of them to zero because they have no track record or something like that.
SPENCER: And I think it's important to note…doesn't Metaculus actually release two different forecasts? They released one that's sort of publicly available. And that's just sort of combining all the predictors, but then they have an internal more accurate one that does this weighting by historical tracker and accuracy?
PETER: Yeah, definitely. That's important to know. So when you visit Metaculus.com, when there's open predictions right now — right now, I'm looking at the Metaculus website -– and I see one thing that says the average is 20%. But that's just right now, that's just the median of all the predictions on the website weighted toward recent restrictions, but it does not adjust on the track record. Metaculus also has this second thing called the Metaculus forecast, which is their own algorithm that not only accounts for more recent predictions, but also weights the predictions by users’ track records, and that's been more accurate historically. But they don't release that information until after the question has closed and no longer take more predictions, because they don't want to skew the predictions while the question is open, where people just kind of take the Metaculus prediction and then predict it and then creating a negative cycle.
SPENCER: Right, you don't want the negative feedback loop.
SPENCER: So how can people use this in their own lives? Can you give me a couple examples where you wanted to know a thing that seemed important, and then how you use these systems?
PETER: Yeah, well, I guess at the time we're recording this podcast, right now, there's this Omicron variant of COVID. And maybe by the time people are listening to this podcast, we'll have a lot more information about Omicron. But right now, it's a fast evolving event. This variant looks pretty serious. And it looks like it might kind of damage some of the vaccines and other precautions that we have against COVID right now. And I kind of want to know, just basic questions about what kind of risk am I taking when I go outside? And what kind of precautions should I be taking? And how is that different from when it was three weeks ago when I never even heard of Omicron. And so I went on to Metaculus and I could see information about the likelihood of the Omicron variant coming to the United States where I live and displacing the current variant, and also the likelihood of it being more or less lethal than the Delta variant that's currently dominant. And also maybe questions about how well existing vaccines will work and things. And this can help me plan out my future very quantitatively — where if I read newspaper headlines, they're all so vague, and I don't know whether to be scared or whether to be calm; there's a lot of alarmism. Whereas when I can see a direct…like, “oh, this has a 20% chance of being more lethal.” That lets me kind of plan more accordingly.
SPENCER: So would you advocate when people have a question about the future, that they, first thing, go on to Metaculus, see if there's a forecasting tournament for it, and then before going to read a bunch of news stories, they just see what the market predicts for it?
PETER: Yeah, I think the typical news reader would probably do a lot better by reading what's on Metaculus than reading what's on the news. And sometimes the news just has outlandish things. I was reading a recent headline from a website where they basically said that it was a political website, and it was trending on Twitter. And they said that Kamala Harris is too unpopular to run for president in 2024. And they're gonna nominate her to the Supreme Court instead, according to some unsourced rumors in the White House. But I think these rumors are ridiculous and the chance of Kamala Harris being nominated to the Supreme Court is incredibly remote. And when you look at Axios’ track record, they've made a lot of crazy statements in the past, attributed to whispers. And I wouldn't trust a lot of the things that they say. Whereas, if you had a prediction market or a market on Metaculus about whether Kamala Harris would be appointed to the Supreme Court, you would get much more accurate predictions. And they probably would be much lower percentages, because people who are wrong actually face consequences and get washed out of the market.
SPENCER: Yeah, it’s a really nice example. I wish I could impress on people the extent to which news has misinformation. Because people talk a lot about fake news, and it is a problem in the sense that there are essentially fake news websites that just write complete nonsense. And their business model is just faking articles. But I actually think the bigger problem is misinformation in the trustworthy news. In other words, misinformation in the sources where people go and assume they can trust it. And I think the level of it is just much higher than people realize, on both left wing media and right wing media. Not to say that all news sources are equally valid, they're certainly not. But I think it's a much bigger problem on all news sources than is generally acknowledged. Do you have thoughts on that?
PETER: Yeah, I think I agree. It's definitely, really a complicated, nuanced picture to paint. But I guess, as smart, informed people, we want to be able to read the news and make informed predictions about things. There's definitely a lot of journalists that do a lot of really good work. But at the end of the day, news headlines end up vague, and it's hard to make conclusions about them. They also end up kind of misleading or wrong in many instances, and journalists that print information that ends up being incorrect — I can't emphasize this enough — they face basically zero consequences. I mean, of course, if you print an outright lie, you would get fired. Also, there's definitely a lot of people trying to fact check claims and things, but a lot of pundits in their opinion columns and things can basically say whatever they want, and they still kind of continue to get listened to and paid no matter how wrong and wrong again they've been. Whereas if you were a forecaster on one of these prediction markets, you would go broke or, on Metaculus, your points would go to zero and no one would listen to you. Whereas if we were just trying to track the record of New York Times’ opinion columnists, I would think our current version of who [sic] we should listen to and who's most prestigious will go away overnight.
SPENCER: Right, because even within a single news organization, you can actually have quite a big difference in reliability.
PETER: Yeah, I guess there's a question of, “Should I listen to Paul Krugman or should I listen to some equivalent conservative economist?” I'd be really curious to see what concrete predictions Paul Krugman has made versus, let's say, Gregory Mankiw, as my respectable conservative economist. I'd love to compare their track records on how they think about the world and what kind of predictions they've made, and just see who's been more right. But there's just not enough information to do that. So I don't really know who [sic] I'm supposed to trust.
SPENCER: Well, I see one potential criticism of this idea that we should be tracking everyone's forecasts and measuring things quantitatively is that there can be wildly different difficulties in forecasting different things. And this can be a problem on Metaculus as well. They're probably -– at least I’ve heard — there are ways on Metaculus to rack up lots of points by making relatively easy predictions. Whereas some people might be trying to predict super hard things to predict and actually get fewer points. And similarly, if we were to track, let's say, pundits, but some of the pundits are making relatively easy-to-make predictions and others are trying to do these really difficult ones. They may be more valuable. It could turn out to look like the person who's just doing first base hits is much better, when in reality, they're actually worse predictors.
PETER: Yeah, I think that is an important critique. I guess my response to that would be that I think, when people make predictions, they really ought to take into account the difficulty of what they're predicting. I want to clarify that forecasters aren't supposed to be correct every single time. They're supposed to make probabilistic statements like, “I think there's a 70% chance that this event will occur. And maybe sometimes I say there's a 70% chance that X happens, and X doesn't happen. And, oh my gosh, I was wrong, because I said it was 70% likely to happen, and it didn't happen.” But that's not exactly how it should be. Instead, it should be that, when I make these statements that X happens 70% of the time, 70% of the time I make a statement like that, I should be right and 30% of the time I make a statement like that, I should be wrong. And so I want to track my calibration over time. Do the things I say happen 70% of the time actually happen 70% of the time? Because maybe they happen more, and I'm actually under confident and I should be more confident in my claims. Or maybe they happen less, and I'm being over confident. But I think when you're making a lot of claims over time, you can account for the difficulty by staying closer to 50-50 or something where you don't know something will happen. And, hopefully then, you can be correctly calibrated even if you don't really know what you're talking about. You also could just refrain from opining on things that you don't know or don't understand and just not make a prediction. And that also might be nice if we get people to stop making claims about things they don't understand and stop making confident predictions about things that are too difficult to actually predict and just admitting ignorance. I think that would be a really good thing.
SPENCER: Yeah, and if you're interested in becoming more calibrated or being a better predictor, on ClearerThinking.org, we have a project that we collaborated with Open Philanthropy Project, which allows you to practice your predictions and calibration. It's called Calibrate Your Judgment. So definitely check that out if you're interested in getting better at this.
PETER: Yeah, definitely. I really wish that all the pundits in the world that opine on things had taken the Clearer Thinking calibration quiz. And the next time they say that Kamala Harris is on the Supreme Court, they make an actual…”Oh, I think this is 80% likely, or I think this is 2% likely” or something like that. And then we can check that over time and see who's right and reward the people who are best calibrated.
SPENCER: So another critique one might have of this way of looking at things is that when you're making probabilistic predictions, it can be really easy to slip out of them. If I say there's a 70% chance of this, even if the thing doesn't happen, I can be, “Well, I said there was a 30% chance it wouldn't happen.” So, it might actually take quite a large number of predictions to start being able to say that one person is doing a bad job. I don't know how many predictions the typical opinion writer pundit actually makes per year, but you could imagine it actually could take quite a while to get enough data to not just give it, not just use it as a free pass to slip out of any bad prediction.
PETER: Definitely. I think that's a fair critique. And I definitely would like people to start assembling a public track record over time. And I think if you're writing like an opinion piece a week, you can maybe have 50 opinion pieces a year and three predictions per opinion piece for 150 total predictions a year. And then over two years or so, you could start to assemble a good track record. So maybe when you're just starting out as a journalist, you can make predictions for the first year, and maybe after the first year, see how you did. And maybe just avoid all of that, and don't count it toward your track record, because you're just practicing. In my first year of forecasting, I was also very atrocious and made a ton of bad forecasts that I'd prefer everyone just forget existed. So just wipe the slate clean. And then in your year to actually do something more serious now that you're more practiced, start building a multi-year track record. I've been doing this on my Twitter account (twitter.com/peter wildeford), where I've been assembling predictions I've made over time. I'm not even a full-time journalist, and I've made 255 predictions since the start of the year of 2021. And at the time of the recording, 149 of those predictions have resolved and I know whether I'm right or wrong, and I've been able to make a calibration curve. And I can see that my calibration curve is pretty good. And then similarly on Metaculus, I’ve made maybe some 100 predictions, I think that's been enough to see how well I've been doing and see what kind of errors I make. I think it definitely takes maybe a year of practice, and then a year of assembling a real track record. It's not something you can just do overnight. But some of these people have been journalists for decades. And I think that there's definitely a lot of time in your career to be assembling a forecasting track record.
SPENCER: I would say this isn't just about knowing whom to trust, but also you just become a better predictor. The more you do it — as you were saying -– it took you a little while to get the hang of it. Making predictions forces you to think in a different way. First of all, it forces you to think probabilistically, which I think is just a really useful mental tool. Instead of thinking in terms of, “this will happen or won’t,” you think, “there's a 70% chance”. It also grounds you in reality. It's very easy to claim something. But then if you're actually going to make a bet publicly, you put on a different hat, like, “Hmm, is this actually going to happen or do I just feel like I want to argue it's going to happen or I want it to happen?” You start distinguishing between these things that are sort of believing it's gonna happen but not quite, really wanting it to happen, or thinking it's a way of attacking the other side, or whatever.
PETER: Yeah, I think that's definitely right. I think predictions really do force you to have clearer thinking. And there's just a lot of times when I'm in arguments and someone says something and I react to it, and then they’ll say, “Okay, but what do you actually predict would happen?” And then I realize I have to think and my track record is on the line here. And actually, many times I realize, like, “Actually, sorry, I don't really know enough about what I'm talking about to make an informed prediction that I want to actually go on my track record.” So I'm just going to respectfully bow out, and then people realize that I don't really know what I'm talking about and I've kind of been pushed out of the conversation. And I think that's really a good thing. I think we should be rewarding people for saying they don't know stuff, instead of people who just confidently proclaim things about things they don't understand.
SPENCER: I think that's an interesting question, because some people think it's actually better to always have a prediction, even if the prediction is, “I think there's a 55% chance of that”, as opposed to bowing out like that. There's a value in the mental maneuver of forcing a prediction.
PETER: I think forcing a prediction is almost always a good thing. But I think we should allow people to bow out because there's a lot of cognitive effort in making a prediction. And I don't necessarily want to force people to do that every time. But I think the important thing is when they bow out, they actually have to leave and you shouldn't be listening to what they're saying about that topic anymore. Whereas I think a lot of times people bow out and then continue to spout nonsense, even after they've bowed out. And that seems like kind of a faux pas in this prediction forecasting track record system of thinking that I'm advocating for.
SPENCER: I think another related interesting and difficult question is how to think about forecasts that aren't just among a fixed set of options. If you're thinking about some major election, for the most part, you know who the options are. But let's say you're predicting by what year a certain technology will occur. There's no obvious default prediction. If you have two candidates, you're trying to bet between them, you're gonna say, “Oh, 50-50 is my default.” I can be calibrated with no effort by just betting 50-50 every time. But if I'm (predicting) what year something will happen, you can't do that. There is no simple way to be calibrated or show no default prediction. So how do you think about those cases?
PETER: I think that can be difficult. I think Metaculus has a good solution where they let you forecast over a continuous range, and you build a probability curve where you assign probability to continuous units and time. Usually, the way that this works on a lot of places is to make time buckets. This was what the prediction markets do where we want to know when something will happen. It will be, “Does it happen between January and March of 2022? Does that happen between March and July of 2022? Or July through December of 2022? Or does it happen some later date?” And there's this last bucket that’s kind of open ended at the end. And even the Metaculus continuous predictions I was talking about, always has a kind of between a point in time and another point in time. And then there's this last bucket, you can assign probability to where it's just later than the end, a kind of an open-ended range. And I guess, if you're predicting buckets, you could then assign uniform probability to each bucket. But then I guess there's still a question of how much you want to assign to that open-ended end bucket there. So I guess that is kind of an unsolved problem.
SPENCER: It's kind of a fascinating, difficult mathematical problem. There are these ways of trying to resolve these kinds of questions with these kinds of priors that seem to contain less information. But it's kind of a thorny one. So I have a few different ways that I've tried to incorporate prediction in my life. Obviously, playing with our own Calibrate Your Judgement app has been good. Also on Twitter, I tried to predict how people respond to my polls, so before each of my Twitter polls, I put a prediction of how I think the buckets will turn out in terms of this way of thinking about the world in terms of forecasting. I started thinking more and more like, “Okay, if I study human psychology and I get good at it, I should be able to predict what people are gonna say, obviously not perfectly.” Humans are complicated. But a sign of knowing things about psychology is the ability to predict human psychology. And I want to get better at it. I started doing that. I also use this great little website called Predictionbook.com. And on Predictionbook, when you have something happening in the future that's important, you can lodge a prediction. So you can just use it for your own life. You can do it publicly, but I just do it in private mode.
SPENCER: So yeah, I have about 117 predictions I've made and so far, I've been quite calibrated. My 50% confidence bucket has happened 47% of the time, my 60% confidence bucket has happened 67% of time and so on. So it's been pretty good. What I used to do on my Facebook posts, which are usually essays, is I would make a prediction in advance of how many likes they would get, and like with a confidence interval. And one of the things I learned from doing that is your confidence intervals have to be much wider than you think.
SPENCER: I think this is like one of the general crash…almost any prediction practice, you learn that we tend to think that we are able to make things better than we can. And there's actually a lot more uncertainty in everything.
PETER: Yeah, that's definitely one of the lessons I've learned, too, where I've tried to make these 80% confidence intervals where there's an 80% chance of the value being between x and y. And I've learned through looking at my track record that, in practice, these actually ended up being 70% confidence intervals instead. And so that means I'm systematically too narrow, and I need to start widening my ranges a lot. For your Twitter poll predictions, have you been keeping track of your track record on those?
SPENCER: I plan to write a script at some point, but I haven't gotten around to do that. It's a pain. [laughs]
PETER: Yeah. That's cool.
SPENCER: And I think it would just be fascinating if more different groups actually started thinking in terms of forecasting. Whether it's psychologists who can predict things about human psychology, like for example, which study will replicate? There've been some interesting forecasting tournaments, where psychologists were trying to predict whether this thing is gonna replicate or not. And it seems like there's something to the idea that, obviously, there are ways you can know about psychology that don't make you better at forecasting what's gonna replicate, but there's a certain type of psychological knowledge that should help you predict that better. Or doctors predicting whether someone will recover from this treatment. That'd be super cool if the doctors had a feedback loop where they're like, I gave that patient that thing, and they didn't recover. And over years and years and years of actually honing their predictive ability, they will be able to have more accurate predictions about what's going to help and give better information to patients saying, “I think about 70% of patients I give this to will get better. And if you're in the 30%, we're gonna have to try a different treatment,” and that kind of thing.
PETER: I think that's really cool. I'd love to see more quantitative thinking like that. I think a lot of people use these kinds of fancy algorithms, but then they also have squishy human intuitions. And I think if you could practice giving calibrated statements, and that you have a good track record over time, that lets you successfully integrate a lot of different types of knowledge into one overall prediction that's hopefully more accurate than anything else. I think also another interesting thing with these, like predicting whether studies will replicate, if you have someone with a really great track record of predicting whether studies will replicate, you almost don't actually need to bother replicating the studies, or you could replicate a few of them just to make sure that guy is not spouting nonsense, or being bribed to kill certain studies. But a lot of times, you could maybe just disbelieve a study without waiting for it to be replicated just because it seems extremely likely to not replicate. Or you even could use maybe some of that upon journal submission or something. There's a lot of different ways that forecasts could be successfully integrated that would be super interesting. I mean, basically, isn't that the basis of all of science to kind of try to predict what things will work and then experiment and see if it works, and build up a track record and kind of went [sic] from there?
SPENCER: Well, I feel like science is more based on confirmation rather than forecasting. You could have a lot of people that are not that good at forecasting, as long as they're – some of them – are coming up with correct new ideas and then we will confirm; maybe that's okay. But it's fascinating to think about science moving faster. If you could say, “Before we invest $10 million in that study, let's bring on our expert scientific forecasters who have a proven track record of predicting on this type of topic. Oh, actually, they think there's only a 5% chance it's going to actually work. Okay, what about this other one? Oh, they think it's a 50% chance. That's a much better bet.” I mean, you could imagine accelerating science by kind of allowing really valid proven predictions to influence where money is invested.
PETER: Definitely. I think this kind of thing happens at hedge funds and stuff a lot, because they're directly financially incentivized in getting things right. Whereas maybe in academia and science, the incentives aren't quite there to deliver accurate information and accurate forecasts. I definitely would love to see forecasts kind of pushed more like the way you're saying,
SPENCER: One of my insane life goals is to have a lot of psychology studies I want to run be such that I can just predict in advance what would have happened so I don't actually have to run the study. And then I just have to run the ones where I can't predict.
PETER: Yeah, or just run like a 10% random sample of the ones and verify your predictions. And then you can assume the conclusions of the other 90% you don't even run.
SPENCER: If you get accurate enough, or at least you get accurate enough about certain types at ones. I think also, human assistance with automation can be really interesting. Have you worked on that or thought about that?
PETER: Yeah, I think some of this is a summary of what I was talking about. If you have combined algorithm and intuition, or is that like something else?
SPENCER: Yeah, exactly. For instance, you could imagine that a person is paired with an algorithm. The algorithm is forecasting things, but then the person gets to use the results of the algorithm as part of one of the inputs into their own thinking. Or maybe the algorithm summarizes things (or something like that) to speed up the process of them assembling the information. You can imagine a lot of different kinds of pairing of person with machine to try to improve forecasting.
PETER: Yeah, I think that's really cool. I think forecasting tends to shine really well in areas where there was just not really enough data to successfully make an algorithm, or you can only kind of make basic algorithms. Also, maybe you want great information from experts or multiple algorithms. Again, I think COVID was really – I mean, obviously, COVID was terrible for the world, but it was — pretty good for forecasting, because it gave us a really cool — though morbid case study — to test our forecasting skills. And a lot of people kind of tried to make these algorithms to predict how cases would increase or decrease over time and in response to certain events, like mask mandates or something. And I think a lot of these complex algorithms were kind of beaten by sort of more simple intuitive forecasting or through combinations of algorithms or things. And it's been, I think, really successful to be able to kind of integrate a lot of disparate sources of information when making these forecasts on fast-emerging topics.
SPENCER: Yeah, I was really fascinated to see some people come out of the woodwork and do amazingly cool things that didn't necessarily have a background in epidemiology. For example, there’s this website COVID19-projections.com, which I think is not being updated anymore, but do you know about that? Do you want to explain that?
PETER: That was really cool. So there was this guy, Youyang Gu. (I don't know if I'm pronouncing his name right so apologies to him if I mispronounced that.) But he was basically this data scientist. He had never really worked on disease data or epidemiology before. And he kind of launched his own machine learning model, taking in all this case data, and doing really good forecasting techniques where he kept track of his track record over time, and tried to modify it to become more accurate over time. And even though he had no prior experience with disease data, all his prior experience with making accurate projections from data sets, and keeping track of forecasts over time and becoming better over time, allowed him to actually outperform a lot of top forecasts made by flashy experts, and even the own model being used by the United States government. And his model ended up being the most accurate for a long time until he ended up shutting down his website in March 2021 because he wanted to move on to other things, and it was just consuming way too much of his time.
SPENCER: And I think also, there was a catch up effect where a lot of models improved and the difference in accuracy became less because people kind of learn from each other.
PETER: Definitely. I think that was really great. And I think he was also doing a lot to keep track of the track records of other models as well that were spitting out data every day. So you could scribble down what they spit out and compare it to actuality, but they weren't really doing as much work to keep track of their track records. And I think keeping track of track records is important in deciding who [sic] to trust and how much to trust them.
SPENCER: One thing that seems to be a big challenge — when you get into a discussion of, can we use forecasting to vet experts and all these kinds of things — is that it can be surprisingly difficult to make a forecast precise. Do you wanna elaborate on that a little bit?
PETER: I guess one problem is, let's say that I want to predict whether this new Omicron variant of COVID will be a “big deal” or not. Will it be something I need to react to and be concerned about? It can be kind of hard to decide what it means for something to be a “big deal”. And if you want to have forecasts, you actually need to be able to specify whether they became true or not. So if I want to say Omicron will be a big deal by the end of next year, then by the end of next year, I need someone to be able to look back and say, was this a “big deal” or not? And a lot of times, it can be hard. Well, what did I mean by “big deal”? And then does the world match what I meant by that? Maybe I might instead want to define something more precise, like, will the United States have more than 200,000 COVID cases per week or something like that. Or will we see X percent of hospitals run out of ICU capacity? These are more precise things that you could actually go out in the world and measure and know, specifically, yes or no, whether it happened. And if these things happened, it may or may not be a “big deal”, but you can, at least, actually verify whether they come true. And I think that can be one of the hardest things and creating forecasting questions is figuring out what you actually mean and writing them down very specifically in a way that will give you a concrete yes or no without any edge cases.
SPENCER: Yes. My suspicion is, if you look at the way that people make “forecasts” and opinion pieces on TV and this kind of thing, you'd realize that very few of them actually work as real forecasts, because it's too hard to tell what specifically would match the thing that they're predicting over, like what time period did they mean it? And what exactly counts and what doesn't count? And therefore, it would actually be really difficult to even grade them if someone were to go through all of their articles and stuff.
PETER: Yeah. Say that you have a technological optimist, and I think, “Oh, it won't be too long until we put a man on the moon again, or it won't be too long until we all are using driverless cars.” There's a question of what did they mean by too long. Do they mean one year, five years, 10 years, maybe even 20 years isn't “too long”. And then also, there's no probability assigned to that. So maybe you follow up and there's no driverless cars after 20 years. And you're like, “Well, you were wrong about that.” And then they can kind of weasel their way out of it. “Well, I wasn't completely sure. I just said that. It probably won't.” But what did you mean by probably? You can't actually make a track record out of a bunch of probablys, wills, might be’s, could-be-the-case-that's; you needed to actually say 10%, 50%, 80% or something like that.
SPENCER: There's a wonderful chart you can find if you Google “perceptions of probability” that shows the way people disagree about what these words mean. And so for example, on this chart, the word “almost certainly”, there's a decent amount of agreement. Most people would put that about somewhere between 90% and 100% confident. But then you take a word like “we doubt that” or “it's improbable”, and there's some huge spread from, let's say, 1% up to 50% that people could mean by that. There's just very little agreement. And so you might think that by adding a word like it's improbable, you're making a prediction, but actually, what are you really predicting? It's extremely hard to know.
PETER: Basically, you need two elements for prediction. You need a fixed specific percentage like 20%, not “unlikely” but actually say 20%. Then you need fixed, specific criteria, like a certain date range, like “by the end of the year” and like a certain thing. Instead of saying there probably will be driverless cars soon, say that by the end of 2025, I expect that 20% of taxi rides will take place in driverless cars. And that's something that you actually could go and quantify, and then make a track record of. And if I make 100 statements like that, you could then see whether 20% of my 20% claims come true, and then judge me accordingly. But almost no one ever does that because it takes effort. And people don't want to be wrong. And there's not really enough pressure to get people to do that. And so I'd like to amp up the pressure and make more of that happen.
SPENCER: So how do we go from the world we have today to a world where lots of people are making forecasts that are being tracked because it seems like there's this really bad incentive issue, which is that people who are top experts now, people who are top pundits, people who are writing the big opinion pieces, they're already in that position of power. And they're doing well without making forecasts and forecasts present risk, because they can then actually be proven wrong.
PETER: Yeah, of course, they won't want to make forecasts; they’d probably do a bad job. And I think if all the top pundits made forecasts, a kind of prestige system of punditry would just evaporate overnight, because you would be able to clearly see who's wrong and who's right. And the people who are right are not going to be the people you expect. So I think the best thing we can do is, just as discerning listeners to this podcast, obviously, you have good taste in some of your sources. So I've just kind of tried to consume sources that make more probabilistic claims. And when you see people make claims that sound dubious, press them for an actual forecast. And I think maybe you yourself, try practicing forecasts. If you're someone who likes to spout claims about the future, try to make them specific and probabilistic and keep track of your track record over time. I'm just experimenting doing this on my Twitter account. I found it to be really fun. Maybe other people would find that fun, too. And maybe you might be more inclined to listen to me based on my forecasting track record, and less inclined to listen to some random pundits.
SPENCER: How convenient is it that the guy who's good at forecasting says we should trust people who are good at forecasting?
PETER: That is pretty funny. Obviously, anyone on a podcast is kind of here to promote themselves a little bit. So I'll be transparent in that. But just seek out people who are willing to make these claims and build a track record and look at their track record and see whether it's actually good.
SPENCER: Let's talk for a moment about how to actually measure how good someone is at forecasting because I think usually people just think about accuracy, like what percentage of time were you right? But it doesn't really work in all of these cases when we're in a probabilistic environment. So for example, if you say you're 99% confident of something happening, and it doesn't happen, you're kind of much more wrong than if you said you thought there was a 52% chance of happening, and it doesn't happen. So do you want to talk about how you think about the right way to measure performance?
PETER: Yeah, I agree that measuring accuracy of forecasts can be difficult, but I think any sort of measure is better than no measure at all. Right now, we don't measure the accuracy of anything. But I guess when you get down to it, I agree, you don't want to just measure the number of claims that come correct because you're judging probabilistically. And like you said, you should be punished more for overconfident claims that come wrong than weakly held 60% claims that come wrong. And there's basically two ways to do this. One is what's called a Brier score. And this is a measure of the distance between your confidence and the actual answer. So if you say something is 98% likely to happen, and it happens, you have a really low Brier score. But if you say something is 98% likely to happen, and it doesn't happen, you have a really high Brier score. And the lower Brier scores are better. It's like golf in that respect. It also magnifies the distance between so you’re really punished for these more extreme claims that come wrong, rather than weakly held claims that come wrong. And then you can take your average Brier score over a bunch of predictions where lower is better.
SPENCER: And the Brier score is basically like squaring sort of the difference between the probability that you predict and whether the outcome came true or not.
PETER: That's exactly right. It's the square of the difference between your confidence as like a decimal between zero and one. And then a binary one if it's true, and zero if it's not. Yeah. And the other way of measuring that I also like is just measuring a calibration where you look at different buckets and predictions you made, like all the predictions I say are 60% right. Do 60% of them come true and 40% of them come wrong? And of all the predictions I say are 98% likely to happen, did 98% of them come true and 2% of them come wrong. And looking at your calibration over time — though, like you mentioned earlier in this conversation, it can be a little bit difficult to compare people's Brier scores because of the difficulty factor — it's a little bit easier to compare people's calibration. But there's also other issues with calibration that we could discuss if you want.
SPENCER: I'll just point out: one, which is that there are certain types of predictors in which it's extremely easy to be calibrated. For example, we talked about this earlier, a binary true-false prediction. You can be calibrated. You just always predict 50% every time and then you're perfectly calibrated. That's really kind of boring. [laughs] So the Brier score is effectively combining accuracy with calibration. There's two different things that matter, and so it's kind of combining them together.
PETER: Yeah, if you predict 50% on everything, and you're right 50% of the time, you're perfectly calibrated. But your Brier score mathematically is .25, and there's definitely lower Brier scores than that. Right now, my average Brier score is 0.159, which is lower than .25. So obviously, I'm doing better than predicting 50-50. But maybe someone has an even lower Brier score. And maybe you should listen to them instead of me; though, again, I guess I hesitate to compare Brier scores to each other unless they're predicting on very similar types of questions.
SPENCER: Right, I just want to mention, there's a whole mathematical field of how you measure predictions in a kind of a probabilistic context. And actually, when I was developing the Calibrate Your Judgment tool with Open Philanthropy, I did a lot of thinking about this. And I ended up writing a white paper -– which is called “Calibration Scoring Rules for Practical Prediction Training” — which digs into this topic very deeply if you want to learn more about it. But I'll just say there's an infinity of rules you could use in these contexts. And so then, there's the question of which ones are actually the right one. Just to mention another one, there's a logarithmic scoring rule which has certain desirable properties; it has certain natural probabilistic properties. But it also has some downsides. You can end up with a kind of bonkers loss in your score, where you make one really bad prediction, and you just lose like an insane number of points. And then it just ruins the rest of your prediction career. So that can be an issue.
PETER: Yeah, my brother actually was tracking some of his predictions. And on the January 6 riot that happened -– obviously it was a very sad event -– but my brother predicted with extreme confidence that they would not successfully get into the Senate building, because that's literally never happened since the War of 1812 with England. I think he had a good reason for making such a strong claim there based on historical events. But it ended up coming true; they did get into the Senate buildings. Then, a logarithmic scoring rule basically ruins his entire forecasting career for the rest of forever, because he kind of missed one extreme event.
SPENCER: Right, but it does call into question a little bit what we really are optimizing for here, right? Because there are certain kinds of predictions, where actually, if you ever said you're 99.999% confident and you're wrong, you should be penalized. Let's say, we're talking about the chance of nuclear war. You don't want to be 99.99% confident it's not gonna happen, then it happens. And with other things, we kind of want it to be more forgiving. Do you have any thoughts about how we choose between different scoring rules?
PETER: Yeah, that's difficult. I don't actually really have any great thoughts about different scoring rules. Like you said, there's kind of a whole field of debate about this. I think my main thing is just keeping track of your predictions and using any sort of reasonable scoring rule seems better than not doing it at all. And 99% of people are not doing it at all so I think that's the real problem. I guess I haven't really thought enough about the merits of different scoring rules and things because I guess you definitely do want people to be sensitive to extreme events. But I guess you wouldn't want to end their forecasting career over missing one question.
SPENCER: I think people don't quite get why there is such a difference between saying something has a 1% chance of happening versus, let's say, a one in 10,000 chance of happening? Why would it be that a scoring rule could give you such a different penalty? And I think it's a little bit hard to get the intuition here. But basically, people's natural way to think about things is often to subtract them. If you subtract one in 100 from one in 10,000, they're not that different from each other in a subtraction sense, right? Because they're both close to zero. But if you think multiplicatively, they're really, really different. One in 100 is like 100 times more likely than one in 10,000. And so that's really where that huge penalty is coming from, because of this multiplicative effect of probabilities.
PETER: Yeah, the way probability works is definitely multiplicative instead of additive. And this might be one reason to think in terms of odds instead of in terms of probabilities. Odds are things like four to one or 10 to one instead of, say, 20%, or 9%. And I think then it's very clear, like 99% is one to 100 odds, whereas 99.99% is like one to 10,000 odds. And you can see the very clear difference there. But unfortunately, psychologically, I just have a lot of trouble thinking in terms of odds and prefer to think in terms of probabilities. But there's definitely some downsides to thinking in terms of probabilities.
SPENCER: Another advantage of odds is it's much more natural for Bayesian thinking.
PETER: Yeah, I remember listening to a previous podcast with you where you talked about that.
SPENCER: Yeah, on the 80,000 Hours podcast, there's an episode where we spent a bunch of time doing real time Bayesian thinking on what's the chance of a war with North Korea or things like this (and you can hear me go to that there talking to Rob Wiblin). But basically, if you have some prior odds, you think it's a three to one chance that your preferred politician is going to win the election. And then you get some new evidence; let's say the evidence is a Bayes factor of five to one, then you could just multiply them together. Three to one times five to one that's 15 to one. Boom, you now have your Bayesian update; you have your new odds.
PETER: That's so cool.
SPENCER: If you try to do the same thing with probabilities, the math is just way wonkier. You can do it. It's all like odds and probabilities. They're just the choice of how you talk about things. There's no fundamental difference. But it's just that the math works out so much simpler in an odds world. You just multiply the odds, whereas in the probability world, you have to do a complicated calculation on a piece of paper.
PETER: Yeah. And usually when I do that, I convert to odds them to convert back, but that’s very difficult to do.
SPENCER: For those who might be confused about the relationship between odds and probabilities, if you have, let's say, two to five odds, how do you think about in terms of probabilities? Well, it's two divided by five plus two, that's how you get the probability. So you take the numerator and divide by the sum of the numerator and denominator, so you can just convert back and forth between them. It's a fairly simple calculation to switch.
PETER: Yeah, then you can do the inverse to turn the probability back into odds.
SPENCER: Probably my favorite way to see why probability is sort of inherently multiplicative, instead of additive is if you think in terms of betting. If you think about, let's say, there was a thing where you thought there was a one in 100 chance of something happening, how much money you would bet against it from your personal bank account, versus if there was a one in a million chance. if there's one in a million chance, it's sort of clear that you can bet way more money against it. Whereas one in 100, you have to be more cautious. So even though one in a million and one in a hundred are both close to zero — in a sort of additive sense they're pretty close to zero — in a multiplicative sense, they're not close at all. And the amount of money you should be willing to bet on a one in a million, that one in a million won't happen, is much, much, much bigger than the amount that you should be willing to bet that one in 100 won't happen.
PETER: Yeah, definitely. I think that's an important point.
SPENCER: So before we wrap up, I want to cover one more topic with you, which is this idea of how do we do more effective research, especially on really complicated, difficult topics? Did you want to take a moment to talk about the research org that you run? And then I'd love to get into the weeds a little bit on how to do better research?
PETER: Yeah, definitely. So I run an organization. It's called Rethink Priorities. We're basically a research organization. We conduct critical research. We try to inform policymakers and major foundations about how to best help people and also non-human animals, both in the present and the long term future. So we're one part think tank, one part research organization, one part consultancy. And we do a lot of that to just conduct research, try to give our research to foundations, see what they do with that research and hopefully they make better decisions. And then we use that feedback to decide what to research. So we're kind of really interested in a lot of these clearer thinking and forecasting type things to use research to help grantmakers and policymakers make better decisions.
SPENCER: So how would you differentiate the kind of research you do and the way you do research from what a typical consultancy would do?
PETER: I think one of our main differences is that we're really interested not in getting a lot of money from clients but, instead, in getting a lot of social return in the form of impacts. We're a nonprofit, US 501(c)(3), we don't try to make money off of clients other than to cover our costs. And instead, we take our research topics based on what we think will make the world a better place, and what policies matter most, what decisions matter most, and how can we best improve that? So there's an inherent optimization to our approach.
SPENCER: So you're trying to make the world better through better research, taking effective altruists’ perspective on this. Is that right?
PETER: Yes, we definitely follow the principles of effective altruism, as you mentioned. We're really dedicated to using our resources to have the greatest impact on the lives of people and animals, mainly through identifying important decisions and seeing how we can make them better.
SPENCER: It does raise a question though, why can't people interested in this topic just hire a standard research consultancy? Take a group like the Open Philanthropy project. They have a lot of interest in getting answers to difficult questions. And so they could go and hire a, let's say, more standard research consultancy that might be writing white papers on behalf of companies or nonprofits. Or they could hire you. I'm wondering, by hiring you to do this, how does the research differ? And what and how does your process differ, so that it might produce different results than if they were to hire any random research consulting firm?
PETER: I think one of the main differences is just that Open Philanthropy is taking this Effective Altruism approach. And we're also taking this Effective Altruism approach, so we, as researchers, have a lot of detailed understanding of Open Philanthropy. Their aims are to explicitly optimize their giving on some sort of metrics of how to make the world better. Whereas I think a lot of research consultancies are more wishy-washy in their claims, or more unwilling to say when things are below our standards, or not willing to compare outcomes to each other. And I think there's a lot of great work out there. And I'm not trying to say it's worthless or anything. But what I am trying to say is that we're kind of aiming to achieve a certain amount of impact per dollar. And some things are very good, and we're supporting them. But they don't meet this bar that Open Philanthropy has in mind with their giving and we're willing to think that way and be ruthless about it and a lot of other organizations and consultancies aren't. So it's really about that optimizing mindset, which goes through everything we do. But whereas like other consultancies might not quite get that or might not do it as well.
SPENCER: I can see this making a big difference in the details. Because these research projects, I imagine, have a huge amount of detail that goes into it, that is not easy to observe in the final product, like which sources do you use and how do you vet those sources. And so, imagine there's a company that's hired to make candy bars. But the people that buy the candy bars from them don't get to eat them. All they get to do is like what they look like on the outside. You can see how you could end up with candy bars that look really nice, but don't taste very good, right? And it's like you want the maker of the candy bar to be really invested in making them taste good, even if you can't inspect them, even if you can't describe how they taste. And so I think that this is an underestimated point that throughout all kinds of research — whether it's social sciences studying human psychology or businesses studying physics or whatever — that there's so much you don't see in the final product. Whether it's an academic paper, or a research report, or whatever, you want the person doing it to have the right incentive and right mindset throughout the whole process so that they're still trying to hammer all those hidden details out in the way that you'd want, even though you're not going to get to inspect them.
PETER: Yeah, that's a really good point. That's probably better than I would have put it myself. I think that there's really a lot that we do in terms of clearly itemizing, what is it that we're claiming. And maybe we're making really specific probabilistic claims, like we talked about in our first section. I think a lot of consultancies aren't doing that, but we're skilled at doing that and are willing to do it. And also, we're kind of clearly seeing why is it that we think that, why we think the things that we did and what did we do. It really goes beyond a typical methods section, and it's what Open Philanthropy has been calling “reasoning transparency” (they have a great article about that, that maybe you can put in your show notes). It talks about not just making claims, but also indicating how confident you are in those claims and why you think those claims that you think. And I think a lot of consultancies don't really clearly state how confident they are in things and also don't really clearly state why they think the things they do. But we're willing to do that. And I think we're valued for it
SPENCER: Could you give a few examples of research topics that you get hired to work on?
PETER: Yeah, we get hired to work on a couple different areas, I would say kind of maybe three main areas. One is work on topics related to animal welfare, especially related to animals farmed for food or killed for other products used by humans. We’re trying to have fewer animals farmed, and also the animals that are farmed to be farmed in higher quality ways. I guess, obviously, we ideally like a final system where no animals are harmed at all. But I guess we're kind of willing to take more incremental steps towards getting to that where we're willing to improve welfare, in addition to reducing the amount of exploitation that takes place. More specific topics within animal welfare that we might be working on, would mainly be: what kind of policies would we be able to lobby for in the state of California to try to improve animal welfare there, and maybe what kind of ballot measures might we be able to consider passing. Because that state allows ballot initiatives, and they already passed proposition 12, which banned the sale of eggs coming from caged hens. And so I think that's a tremendous welfare improvement and super exciting. Our second topic that we work on in no particular order is global health and development. We are 700 million people or more that live in extreme poverty globally, and we're trying to find better interventions to improve their economic outcomes and also their health outcomes. Because I think if we can improve their health, we can empower them more to improve their own economic outcomes. And so we've worked with a bunch of different organizations that tried to identify and promote these sorts of interventions that we can fund to improve the lives of humans in lower and middle income countries. And then lastly, we work on issues with the long-term future for humans and animals. I think that we could have a bright and long future ahead of ourselves that may span many millions of years. But in the process, we have to navigate various risks and disasters. This may be nuclear war, or future pandemics like COVID or worse. We want to make sure that the next time we face something like that, we have a much better, more sophisticated response or else we might risk severely curtailing our long term potential.
SPENCER: So, what’s a more specific topic you might be asked to research on that?
PETER: One thing might be, I think that we're still trying to figure out is, how do we actually reduce the risk of nuclear war? What levers might we have? What might we actually be able to fund? Is reducing the number of nuclear weapons a good thing because then there's like less nuclear weapons to use to kill everybody? Or is maybe reducing the number of nuclear weapons actually a bad thing, because then you don't have as strong of a deterrent and there's not as much mutually assured destruction? One might argue that a main reason there's not a nuclear war is there's just too much fear of reprisal, that if you launch a nuclear weapon, you'll get nuked yourself. And maybe actually having a bunch of nuclear weapons is actually promoting peace. So a little counterintuitive. I don't know which kind of way of thinking about it is true. And so I'd love to just research that question more. And that's something we're working on.
SPENCER: So these strike me as really difficult problems, like you're going to inevitably bump up against tricky empirical questions where there may not be good data. Or you might have to sift through a lot of information. You're gonna bump up against maybe philosophical considerations, like how bad is it for a chicken to live in one environment versus another? And what is the experience of a chicken even like, right? We're starting to get into stuff philosophers might debate, like what can a chicken experience? So I'm wondering, just what are some of your thoughts about how you approach these really thorny questions?
PETER: I think that we're definitely practicing a really interesting form of philosophy. We've actually hired professional philosophers to work with us on philosophical questions. I think that is probably the first time I've ever heard of philosophers being employed to work on philosophy outside of academia. I find that super fascinating rather than, I think, a lot of philosophy, which is kind of famous for debates that have literally been raging for thousands of years without any end in sight. We actually do need to get work done and make some grants and make some policies. And we don't have time to endlessly debate philosophy so we've hired these philosophers to work with more of our scientists to synthesize information and just come up with their best, instead of philosophical judgments about what we ought to do. And then we just commit to actually doing it. And there might be some decision-making under uncertainty. And we might be making forecasts like I mentioned in the first half of our podcast. But yeah, we're actually trying to come up with our best idea of what actually is animal welfare, and how might we improve it. And I think another important question is this: let's say you could help 10 chickens or help one cow? How would you make a decision between that sort of thing? So we're actually trying to make some of these thorny trade offs based on understanding chicken cognition and cow cognition and that sort of thing.
SPENCER: A concern that someone might have with all this is, if you have to bake in too many assumptions into the analysis, do you get something in the end that actually can be trusted? Or at some point, is the answer just going to be we don't know because we'd have to make too many assumptions to have an answer.
PETER: I think that's a really important issue. And something that kind of keeps me up at night, too. And I don't think we have a great solution to that. I think two things that we do is, first, I think we're very clear about what assumptions we're making and why we're making those assumptions. And that lets people be able to clearly see that, “Oh, I don't share that assumption so now I know what follows might not make sense and maybe we should argue about that assumption some more.” And then we tried to make a reasonable set of assumptions that grant-makers and other clients share, so that we can still align on things. I think the second approach that's also important is trying to come up with ideas that work across a range of assumptions. So you might think, “Oh, actually, this is robust to a variety of ideas that you might have on this topic.” And so the debate kind of doesn't really matter as much because the outcomes we pick work under a number of different assumptions.
SPENCER: Do you think you can avoid baking in the answers to important philosophical questions by trying to make them robust? Or does that only go so far, and at the end of the day, you have to make a judgment on some of these thorny questions?
PETER: Yeah, I think that works to some extent and for some problems, but on a case-by-case basis. But I think in a lot of important problems, we do definitely just need to make some final judgment calls on various assumptions. But we're lucky, I think; at least within the Effective Altruism approach, there's some broad alignment on certain assumptions that other approaches might cast in doubt. For example, we're willing to take an explicit optimization approach, we're willing to think specific things about how animal and human sentience work. I guess if you're not really willing to make any assumptions, I can't even know whether, like you Spencer, whether you're conscious or not; there's really nothing I can specifically observe to see that but I kind of assume that, because you act and react in similar ways I do, that you're conscious. And then I kind of extend that sort of analogy to animals as well based on their behaviors. And then we can kind of make assumptions like that from there.
SPENCER: Just certain premises you can take for granted. But given that you take them for granted, you can move forward and actually draw conclusions.
PETER: Yeah. And that might be a domino effect, where if you take certain assumptions for granted, that makes it easier to then take other assumptions for granted. And then eventually, you can cohere a workable framework. And I think, as long as you're clear about what sort of framework you're working in, you can then actually get stuff done.
SPENCER: How quantitative do you think you should be about these questions? Let's say you're trying to figure out which proposal to back to help animal welfare. Are you trying to put numbers on that? And how do you approach that?
PETER: Yeah, we are trying to put numbers on that. I think pretty explicit calculations are a very helpful part of making these sorts of decisions. But we're not necessarily just making a spreadsheet and then choosing whatever comes out on top. I think it's like when I mentioned forecasting earlier, you want to take in inputs from algorithms, but you also want to take in inputs from experts and be able to integrate that into an all-things-considered judgment. So I think quantitative outputs are an important factor in our decision making but they're not our only decision-making factor, if that makes sense.
SPENCER: How do you think about uncertainty? Are you explicitly trying to model the uncertainty in those predictions like creating a confidence interval in the prediction or credibility interval?
PETER: Yeah, we're always trying to create these sorts of confidence and credibility intervals. And then also, a lot of times, we're trying to create explicit forecasts that we then can track over time as well, when these are all probabilistic outcomes.
SPENCER: It seems to me that one of the trickiest things that can happen with this kind of work -– you work on these really difficult research questions and you're trying to make it numerical — is you do this sort of uncertainty estimates and you realize that the uncertainty estimate is ranging over three orders of magnitude or something like this. Does that happen a lot? And what do you do if it does?
PETER: Yeah, that does happen a lot. I think what you can do from there are a couple of things. Sometimes even if it ranges over orders of magnitude, there's still enough confidence to be able to compare it to something else that you can then make a decision. And other times where you can't, you can figure out what inputs is your model most sensitive to? And is there anything you can do to narrow that and get more comments on that? So then you might just be funding additional research to look into a particular input and more depth.
SPENCER: And then that will tighten the model up?
PETER: Yeah, exactly.
SPENCER: That's really interesting. It's a little bit like my approach to making certain types of decisions where I'll make a spreadsheet and I’ll weigh different factors, and then look at the final model, and then say, “Okay, what's really driving the final outcome?” And then feed that back in and find out that's being driven by this factor. Do I really think that's the most important factor because it seems to be driving this decision? And then that will allow me to say, “Oh, well, maybe that's not as important as I thought”. And you kind of use the model, not just to make a final prediction and trust it, but use the model to force yourself to consider all the factors and then also to help you introspect about what matters in the situation.
PETER: Yeah, I think that's very similar to the kind of approach that we're taking here.
SPENCER: So what do you think of people who say that trying to quantify some of these things actually creates problems? I guess I'm trying to steelman that point of view. I think maybe one thing they could say is that it can create problems in that it kind of nudges you towards the quantifiable. So maybe it nudges you away from trying to analyze things that are not quantifiable. But maybe another thing they would say, is that in attempting to quantify, maybe you lose sight of important things. One way of thinking about a difficult problem is to involve every single factor and try to get a weight on each of them and combine them in a way to get an estimate. And another way to think about it is to say, can we look at this from a different perspective where it becomes simple? Or can we cut through? Can we say, the real question here is just this one factor, and everything else is noise. And you're trying to build a much simpler model that you can hold in your mind and work with explicitly, rather than trying to build a big numerical model that combines everything.
PETER: Yeah, I think that is a really important issue. Granted, when I was first starting, I was super excited about algorithms. And I was always trying to build a really complex model that captured everything perfectly, then trust that model completely to be better than my own decision making. But I think, as I've learned more, I think that kind of approach doesn't really work very well. And the best thing to do is to make a couple different simple models that everyone can understand and critique well, and don't trust any of the models all that much. And try to have multiple people integrate the information, and come to some all-things-considered judgments. And then have those people debate and discuss with each other and come to some final conclusion. And then just try to trust that conclusion, though still maybe doubt it to some extent.
SPENCER: I see. So you're trying to attack it from different perspectives with different models. And then even the outputs of those models are gonna be scrutinized. And then a discussion of how do we combine them? And how much to weigh the different models?
PETER: Yeah, exactly. I found that to be a more accurate approach. I think a lot of these topics’ models can be wrong for all sorts of reasons. Sometimes the reasons can be incredibly simple. Like one time, I made a complex model, and it turned out that I was misplacing a parenthetical mark. And then when I corrected that parenthetical mark, it actually changed one of my calculations so much that my ultimate answer increased by seven times. And so that was tremendously embarrassing. Also, sometimes models can be perfectly well coded, but then go wrong for more complex reasons. Maybe there's some really important consideration that you failed to capture, or some other thing that you weren't thinking of that actually changes everything.
SPENCER: Yeah, I'm of the opinion that software has a lot more bugs than people realize.
SPENCER: We were working with a Python package for one of our projects, and we were trying to use this package. It was doing weird stuff and we were very confused. And then after two months of wrestling with this package, one of our team members uncovered that there was a fatal flaw in the design of it, where it was just giving the complete wrong answer. And who knows how many people have relied on this package, because it was an open source one that people were using. And we emailed the creator of it, “Look at this line of code. We think this is just doing the complete wrong thing.” And he responds, “Oh, yeah, if you remove that line, does it work?” And we were like, “Oh no.” So we just abandoned that package, and we just wrote it from scratch. But I actually think lots of software that people rely on has bugs and like, with really, really simple things, you can often tell if they're broken, but the more complex they get, the harder it is to tell. And this can be really, really devastating if you're relying on a big model.
PETER: Yeah, that's really scary. It's important to think about. Yeah, I worry how many bug-written software's influencing millions or billions of dollars right now?
SPENCER: Yeah, I really hope the climate models don't have some major bug, like some Fortran package written 30 years ago [laughs] that nobody knows how it works. Because that would be bad. So that's one problem that can happen with complex models. But there's also just the issue of model uncertainty, where you can have uncertainty that's baked into the model. So the model itself has randomness in it, and then you get a confidence interval and your prediction or whatever. But then there's the fact that the model itself is not perfectly aligned with reality. And then that creates extra uncertainty. How do you think about that?
PETER: I think that's an important issue. And some of this kind of comes with — when we were talking about forecasting, similar things I've personally been really interested in — what I like to call the many models project. Basically, I get a bunch of different people to try to make the same model and output a 90% confidence interval. And if these models are perfect, they should all be outputting really similar models. But apparently, when you get multiple people making the model, and they make different initial assumptions, and maybe calculate things differently, and maybe some people have bugs in different places, you actually end up getting very different intervals. And so maybe, you then might want to have a metamodel or something, or whenever you see a 90% interval, pretend it's actually an 80 or 70% interval instead. I think it is a good rule of thumb.
SPENCER: So what's the example you would apply that to?
PETER: The example I was thinking of was, when deciding how important tobacco taxation is in lower and middle income countries where there's still a lot of smoking and not really that many tobacco taxes, and even small taxes could really disincentivize a lot of cigarette purchases and have a lot of positive health outcomes. We were trying to calculate more precisely how valuable a dollar spent on a tobacco taxation lobbying campaign was and it really differed a lot, depending on what assumptions you made about how expensive lobbying is, how successful lobbying is likely to be and how people might respond to taxes and things. I think another clear example that might be easier to think about is in political polling. Let's say, you want to know who's going to win Florida in the 2024 election. You might see a political poll that says Biden is up by 2%. And there's like a 2% standard error. So then you think that that means that he's pretty unlikely to lose them, because it's within the standard error. But actually, polls underestimate their standard error all the time, because they're only talking about statistical standard error with their sampling. But that's assuming that their sampling is perfectly representative of the population. And that like ends up rarely being the case, because you don't get a perfectly random sample when you call people for polls. So actually, a surprisingly good rule of thumb is just to double the standard error. Instead of seeing it as two, see it as four, so that it's actually 2% plus or minus four, and that ends up being a more accurate view of how good polls are these days,
SPENCER: Right, sort of like the uncertainty in your uncertainty. And if you don't take that into account, you're going to tend to think that things are more sure than they really are.
PETER: Yeah, so there's not just uncertainty in the final output, like when you output a 90% confidence interval. There's also all sorts of uncertainties around fundamental assumptions you're making, that ultimately can lead to much more wild events than if you just take your model output at face value.
SPENCER: I'm a mathematician by background. I also do a lot of data science and machine learning and stuff like that. And I have to say, to put my cards on the table, I'm more skeptical of big numerical models than you are. And I'm wondering why that is. Let me make my pitch for why I'm more skeptical of this. I think that what has happened in practice, when I've tried to investigate some of these things, someone has a really long analysis when estimating some numerical outcome that, on page 14 of the analysis, I'm thinking, I don't know about that choice they made here. And then the rest of the thing is now influenced by this choice that I don't know if I agree was a good idea. And now, I don't know how to propagate that decision through the model. You know what I'm saying? It feels like there's too many things for me to get off the bus with the way they're doing things. And then that puts me in a really precarious position in terms of believing the output.
PETER: Yeah, I definitely agree with you on that being a big issue. I'm also pretty distrustful of big models and prefer smaller models. I also really like transparent models where you can personally get a copy of the model, and it's simple enough that you can understand how it works. And you can change the assumption. You can be like, that's not 14, that should be nine. And then you type in nine instead. And then you can see how that influences the final calculation.
SPENCER: That's very cool.
PETER: Yeah, I did this one thing a while ago, where I made my own political poll. But I actually let everyone have access to all the raw data and all the modeling I was doing. And then also even had slider bars where you could adjust a lot of the fundamental assumptions I was making, and see how that changed my final poll number. And I thought that was really cool to see all the hidden assumptions that go into polling.
SPENCER: Yeah, I saw a really cool example of this for estimating impacts of animal suffering on different animals, where they have choices. Do you weight animals by the number of neurons, the square root of the number of neurons, or like all animals are equal? We just feel funny, because it's like, holy shit, that's a difficult question. But at least you can see how it changes things, right?
PETER: Yeah, definitely. And then I think, actually, it could be the case that maybe neuron count doesn't even matter at all for tracking moral weight, or matters a lot less than we think. There might even just be a fundamental assumption of whether neurons are something to pay attention to, let alone what function you apply to them.
SPENCER: Yeah, you've actually looked into this question to some extent. What were some of your conclusions around it?
PETER: So we were thinking, for neuron count specifically, there's definitely some good reasons to think that neuron counts don't actually track how much we ought to value an animal. And that's mainly because there's a lot of neurons in brains and a lot of them go to things other than sentient experience, like they just need to manipulate muscles or other things about the animal. And you can think even an elephant typically has more neurons than a human. But they're not necessarily smarter than humans, or at least they don't seem to be. So I'm pretty skeptical myself of whether neuron counts are really the specific thing to latch on to when trying to compare animals to each other.
SPENCER: Okay, so before we wrap up, I want to get a little optimism here. Paint me a compelling vision of what the future world would look like, if we start using forecasting more.
PETER: Yeah, kind of combining both what I was saying about forecasting, and what I was saying about my research org, maybe a compelling vision of the future is one in which we have a bunch of researchers, and they're researching a lot of interesting things. And they're making very specific quantitative claims that we then aggregate into figuring out the world. And we have a good idea of how confident we are in everything. And that when we're consuming news sources, we're getting information that is quantitative, and we're listening to the people that have the best track records of making things in the future. And then when lawmakers are making laws, they're listening to these people with really strong track records of being right about things. And they're not really listening to charlatans, or people that don't know what they're talking about or people that have the most lobbying dollars. And we're getting a lot of really great policies and really great research that's oriented more towards impartial, social good and oriented less towards people just spouting out nonsense, because they want to look cool on Twitter and get a lot of Twitter likes.
SPENCER: That reminds me of this idea of, imagine we lived in a world where once a policy is proposed, investors can bet on whether it will achieve its goals if implemented. And basically, if they bet that if it's implemented it will achieve its goals, they get paid out if it achieves its goals, but they lose money if it doesn't. And they can also take the opposite of that and so the idea would be then, like a new policy is proposed and suddenly all these bettors line up making bets. And we actually get like a probabilistic estimate of this policy achieving its goals, just kind of a wild idea.
PETER: Yeah, that is a wild idea. But I think I'd like to see that at least experimented with. I guess you'd want to make sure certain things are true, like the market’s very liquid and kind of safe from manipulation and other things. But I think that would be really cool to at least have as one input to decide what policies are good. I'd love to see more of that.
SPENCER: Peter, thank you so much for coming on. This was a super interesting chat.
PETER: Yeah, thank you so much. It was really great to be on the show. Thanks for everything you do.
JOSH: A listener asks, what parts of your math background helped you the most in your work right now.
SPENCER: One thing I think is cool about having studied math is that when I read papers in psychology, I have a lot of math background drawn in terms of how I would think about analyzing the data, how I would think about designing the study. And I think that gives me a somewhat unusual perspective, because I'm not necessarily rooted in the exact method that everyone learns or the exact method, the standard in the field. I can think of it and say, “Well, what method would I use?” And it's not always the same as what other people use. I also use my background in math and machine learning in some of our product development. For example, with Minduser, our app for anxiety, we apply machine learning to try to help predict what technique is going to help someone calm down when they need it. And so that was a nice machine learning application. In other work of ours, we're applying certain mathematical ideas. One of the products we're building helps people analyze datasets. And so there's a lot of math that goes into that.
JOSH: Do you think math is the skill set that helps you the most? Or would you say, computer science is like a close second, or philosophy or any other particular field?
SPENCER: I think it depends so much on what you're trying to do. And for me, one of my projects might have a whole bunch of math and another project, none of it. Or philosophical thinking might come up in one area, but then in another, it's just not useful at all. Let's take a topic like designing questionnaires. You might think that that has nothing to do with philosophy, but then suddenly, philosophy comes up unexpectedly, because you're like, “Well, what are we really trying to measure here? What do we mean by personality?” And then you have to, like a philosopher, temporarily think about what it even means to measure personality. Or you might then suddenly pull into math skills and say, “Well, if our goal is really to make a prediction from these personality questions, then we might need to think about that in a mathematical way and think about what's the way to make the most accurate prediction here.” So yeah, I feel like these things come up in unexpected ways, and it's super project-dependent.
Click here to return to the list of all episodes.
Sign up to receive one helpful idea and one brand-new podcast episode each week!
Subscribe via RSS or through one of the major podcast platforms:
Apple Podcasts Spotify Stitcher Google Podcasts TuneIn Amazon RSS
Host / Director
Broke for Free
Quiet Music for Tiny Robots
Please note that Clearer Thinking , Mind Ease , and UpLift are all affiliated with this podcast.