with Spencer Greenberg
the podcast about ideas that matter

Episode 108: Taking pleasure in being wrong (with Buck Shlegeris)

Enjoying the episode? Want to listen later? Subscribe on any of these apps or stores to be notified when we release new episodes:

June 9, 2022

How hard is it to arrive at true beliefs about the world? How can you find enjoyment in being wrong? When presenting claims that will be scrutinized by others, is it better to hedge and pad the claims in lots of caveats and uncertainty, or to strive for a tone that matches (or perhaps even exaggerates) the intensity with which you hold your beliefs? Why should you maybe focus on drilling small skills when learning a new skill set? What counts as a "simple" question? How can you tell when you actually understand something and when you don't? What is "cargo culting"? Which features of AI are likely in the future to become existential threats? What are the hardest parts of AI research? What skills will we probably really wish we had on the eve of deploying superintelligent AIs?

Buck Shlegeris is the CTO of Redwood Research, an independent AI alignment research organization. He currently leads their interpretability research. He previously worked on research and outreach at the Machine Intelligence Research Institute. His website is

JOSH: Hello, and welcome to Clearer Thinking with Spencer Greenberg, the podcast about ideas that matter. I'm Josh Castle, the producer of the podcast, and I'm so glad you've joined us today. In this episode, Spencer speaks with Buck Shlegeris about maintaining rational beliefs, asking the right questions, and machine learning and AI alignment.

SPENCER: Buck, welcome.

BUCK: It's great to be here.

SPENCER: So one reason I was excited to have you on my podcast is that I think that you think differently than anyone else I've met. I'm fascinated by the way you think about things. And so today, I want to go through a handful of different topics and just get your ideas on them.

BUCK: Sounds great.

SPENCER: All right. So the first is a topic related to rationality broadly. You have this blog post about the distinction between other people being wrong versus you being right. So can you tell us what that is and why you think that's important?

BUCK: Yeah. So I think I used to draw this false distinction between these two worldviews. One of these worldviews is that lots of people have lots of crazy beliefs. And it's easy for me to have more accurate opinions than what most people believe. This is the kind of belief that you have if you're a reasonably smart 16-year-old who's read some economics textbooks and thinks that a lot of people are dumb. And then there's this other perspective, which is like, "Oh, no, I think in general, when a lot of people think something, it's often hard to beat the market and have accurate ideas." And I think that, in hindsight, I made a lot of mistakes by trying to phrase the disagreement in terms of how these two sides are missing out on some important nuance (which I'll try to say). So here's two sides: One side is, "It's pretty hard to have beliefs that differ from the majority of people and be right about that," and the other side is "No, it's pretty easy, if you think about stuff, to end up with pretty weird beliefs that are correct despite the fact that most people don't hold them." There's the sense in which one of the core questions of rationality is: How hard is it to be right about things? And to what extent should you be persuaded by looking into stuff a bit on your own? And that's a question that I've historically found quite interesting.

SPENCER: Right, because on the one hand, you could say, "Well, if all these different people believe things, and you're just another person, why shouldn't you be more likely to be right than anyone else?" You're just another person among many. And so, the average person is not going to be more right than the average person. And on the other hand, you can say, "Well, if we look around the world, it seems like there's all these crazy things. Some people believe that the earth is flat, and some people seem like they have no understanding of economics, and so on. So it just seems not that hard to point out how other people are wrong." So how did your thinking on this develop over time?

BUCK: I guess the key difference between how I think about this question now and how I thought about it five years ago is it now seems to me like it's not very hard to notice ways in which almost everyone has been pretty dumb about things, or it's not that hard to notice ways in which widespread beliefs are quite false. But it's a lot harder to come to confident, specific, and correct beliefs about the world. Some examples of this: I think that it's not very hard to correctly notice that a lot of questions about economic policy — all the mainstream opinions — are pretty different from what the actual good economic policy would be. However, I think it's much harder to answer specifically what you think your favorite choice of economic policy is.

SPENCER: Do you have any more specific economic policy examples?

BUCK: I think I want to go for a different example than the economic policy one actually. I think that a key example of this to me is that I've been interested for a long time in the question of whether really powerful AI might pose an existential risk. And back when I first started thinking about this in 2010 or whenever, the mainstream opinion seemed to be that this was pretty crazy and definitely wouldn't happen. And I think I was correct to be skeptical of this mainstream opinion. But I think I was quite wrong about most of my specific beliefs about how AI would go down and the mechanisms by which it might pose an existential risk. And so, I think I made this incorrect jump from "The mainstream opinion on this topic seems quite clearly false," to "My understanding of this topic is probably pretty crisp and accurate." I think that my attitude should have been much more like, "Huh, there's this really interesting and important point that most people seem to be not thinking about. I'm probably very confused about it. But it sure seems worth more time to try and dig into more and get more accurate opinions on."

SPENCER: That example resonates with me, because I find that often people who dismiss potential risks from AI, their arguments seem really weak in the way they dismiss it. And of course, it doesn't mean they're wrong. No matter how weak someone's argument is, it doesn't imply that there isn't a strong argument to be made, but it's just shocking that people don't make better arguments against it.

BUCK: Yeah, that seems right to me.

SPENCER: So how do you view this asymmetry? What enabled you to identify that there was maybe something more interesting going on with AI, but also convinced you that you kind of knew the answer, simultaneously, when you may not have?

BUCK: This was not particularly my situation. But you can imagine the smartest 16-year-old in their class at school might end up having a bunch of dumb opinions, because it's easy for them to win arguments. If you're just better than other people at arguing for arbitrary policies, you might end up picking some contrarian direction. And then in fact, whenever you argue with someone about your beliefs about what policies are good, you end up winning your arguments, you end up winning a bunch of arguments about how communism is great or whatever. If you're arguing against people who are as good at arguing as you are, then I think that arguing with people and seeing if they have more persuasive arguments than you is reasonably good at allowing you to come to true beliefs. But in cases like the AI one, where it kind of felt like people weren't really trying, it wasn't like there were a bunch of people where it was really, really important to them to understand whether AI was an existential risk, and they all believed that it in fact wasn't. The situation was more like: Lots of people didn't really take this question very seriously, and dashed off some arguments that didn't seem very good at all about how AI probably wasn't an existential risk. When I looked at this, it was pretty easy to, without very much thought, notice that these arguments seemed pretty bad. And then I wasn't practiced at trying to do the kind of intellectual labor where it's not just figuring out who is right out of two sides of an argument. It's trying to go further and get more accurate beliefs on your own, and I guess I just hadn't learned how to do that at all. And I failed to do so.

SPENCER: Yeah, it's interesting, because when there's two sides, essentially, all you have to do is take the evidence of one side, compare it to the evidence of the other side, say which is stronger. And that's not always easy, but it's more straightforward. Whereas if you're like, "Oh, there's two sides in front of me. But there also are thousands of other possible perspectives one could have, and somehow I have to come up with the right one." It's much harder to specify what that even is. It seems to involve some aspect of creativity and sort of an open exploration and so on.

BUCK: And in particular, if one of the sides is you, and you already have a set of arguments that are, in your opinion, stronger than the other side, you're unconstrained now, and you can take your beliefs in a whole bunch of different directions. And you'll be able to set it up so that your beliefs are still more persuasive to you than what the other side thinks. And maybe you'll even be right — that your beliefs make more sense than the other side's beliefs — but it turns out that making more sense than other people is not the bar for being accurate about things. And you need to hold yourself to a higher standard.

SPENCER: Right, it's like comparing two chess moves and being like, "Oh, that one's better than this other one, but maybe they both cause you to lose." [laughs] You might have just missed some other really much better move. Well, you bringing it up also reminds me of this failure mode that high IQ people fall into sometimes, where they're used to just being so much better at being coherent in a very complex way that other people can't reply to, that they can overflow attack other people's minds. And they convince themselves that they're right much more often than they are because other people just struggle to argue against them. They can form five paragraphs on the fly that make a really complex argument — that's wrong for several reasons — but almost nobody in real time can find the flaws in the argument.

BUCK: Yeah, I think I made a notable mistake of this type yesterday talking to someone who is very smart, but not quite as speedy at thinking on their feet as I am. And I think I lead our overall discussion to have more false conclusions by rapid firing some bad arguments in a way that I regret.

SPENCER: Yeah, I've seen politicians do this almost as it seems like as a strategy, where they'll get asked a difficult question, and they'll respond. At first, it sounds like they're going to answer the difficult question, but then they just keep talking for a long time. And I think what literally happens is that people forget what the original question is. And so by the time they finish speaking, a lot of people don't recognize that they didn't answer the question, because they just literally ran out of short-term memory to keep track of what they'd been asked.

BUCK: Yeah, for sure. One way that I tried to get around the problem — where you get used to being better at arguing about things — is sometimes I try to study subjects in which I'm an amateur, and then talk to people who are relatively expert in those fields as just practicing being repeatedly trying to make claims about things in a context where someone else can really easily shut you down. And I find this really enjoyable actually. I was studying organic chemistry for a while last year, and one of the things I love about studying organic chemistry is that it's extremely cheap to find people who will just answer organic chemistry questions instantly, enormously better than I do. I'm used to thinking of myself as pretty good at performing intellectually at tasks. But I feel like organic chemistry majors who know an enormous amount about this subject are just a dime a dozen. And it's just really satisfying to practice a certain type of like not being overconfident, just through the person being able to immediately think five steps ahead of me like, "And then the proton comes off over here and blah, blah, blah." And I have found that this is a useful exercise for maintaining humility.

SPENCER: So I think most people would find that pretty unenjoyable. Why do you think you enjoy that?

BUCK: Uh, I don't know. I don't know if it's enjoyable, per se. It's somewhat enjoyable. I just think it's a really crucial thing to practice. I think being wrong and remembering what it's like to be wrong in cases where there's no way that you could miss being wrong is a really core skill. And one of the surefire ways to completely waste your life is to not have that skill when you need it. And so it seems really crucial to practice it whenever you can. And this is a relatively cheap way of doing it. It's funny in the same way that having extremely awkward conversations can be enjoyable. Sometimes you're just like, "Oh, geez, this is really about as bad as it could be or something." I enjoy playing a trick on myself or putting myself in a situation where the only way to get through it is to do something which is amusingly unpleasant or something. I think that's why I enjoy it.

SPENCER: You seem like the sort of person that might enjoy Curb Your Enthusiasm, where I'm the sort of person that cannot watch that show because I squirm in embarrassment.

BUCK: Yeah, I find a squirm comedy or whatever you call it pretty intolerable, but perhaps I should spend more time trying to get over that.

SPENCER: One way I try to do this is by posting essays publicly, often on Facebook. And a lot of times I'm really happy with how things are received. And every once in a while, [laughs] I get my ass kicked by people who know way more about it than I do. But I find that really helpful, because it helps calibrate me to be like, "Oh, am I gonna get my ass kicked on this topic? Hmm, let me think about it. Let me try to poke flaws in my own argument before I post it." And over time, I build up more confidence as I get my ass kicked less. But one time I posted a thing about philosophy, and some philosophers really took me to town [laughs]. It was really obscure stuff like John Rawls, too. And I was like, "I think that's a good life lesson."

BUCK: Yeah, this happens to me, sometimes I really enjoy the experience of posting things and having basically that experience. I have this experience in private Google Docs most of the time. Some people who I particularly enjoy getting harsh criticism from (because they're very smart and disagree with me on a bunch of stuff) will write very nicely. It's a good time; it's very hard though. I'm pretty interested in this question of the extent to which rationality is about pain tolerance and the extent to which — like the core problem of rationality — is that doing things that will enable you to find out whether you are in fact right is just deeply, deeply unpleasant, a lot of the time or at least some of the time. And if you aren't willing to endure substantial discomfort for the sake of having accurate beliefs, you just won't. You'll just eventually accrete false beliefs that it would be painful for you to update away from. So I have a constant vigilance approach where it feels like it's fine for me to get a lot of things wrong in my life, it's okay if I file my taxes late, and then the government gets annoyed at me and fines me or something. But it's really not fine for me to become too afraid of the pain of writing down my thoughts and getting criticism on them. I think a variation on this which seems really important to me is that a lot of the time when I write things down, I'm writing them down with the goal of being falsified easily rather than with the goal of being defensible. And I think this is quite different from how people write things a lot of the time. But I think it's, in fact, very useful. So for example, suppose I have some new opinions about how you should go about learning machine learning. And I think that a lot of times when people write up their thoughts on complicated topics, they try to hedge. And I try my best when I'm writing about things like this, to do the strong-opinions-weakly-held thing or whatever, where you can say at the top of the document "I don't know, maybe I'm wrong about a bunch of this." But then throughout the document, you try to state things as bluntly and directly as possible with the goal of maximally demonstrating the difference between what you're claiming and what people normally think, so that it's as easy and obvious as possible for people who think you're wrong to jump in and say where they think you're making a mistake, or that they think you're making a mistake.

SPENCER: I like that approach because I think that there's often this form of writing people do where they're not saying anything concretely enough or strongly enough for you to really disagree. They're like, "Well, this thing happens." And of course, sometimes it doesn't happen. But still, we should take it into account. And you're like, "But there's a way of reading that where you're not really making any claim at all." And you always have a defense of like, "Well, I said that it sometimes doesn't happen, you know." I really like that I go into writing essays with a different mode, which is just me trying to figure out what I believe is true, as opposed to the mode of trying to convince the audience and the mode of trying to allow the audience to falsify me. So maybe there's three interesting different modes of writing.

BUCK: I think that seems right. I think there's a version of this that you should do internally as well. When I'm trying to figure out what I believe about things, a lot of the time, my approach is to make up some extremely oversimplified frame. For instance, an opinion I decided that I believed this week upon thinking about it is: When I first started doing machine learning, I thought it was about having intuitions. But now I think that the hard part about machine learning is executing on the engineering required to do your experiments. And when thinking through this, I found it really helpful to try and exaggerate this as much as possible inside my own head and to temporarily behave as if my belief was that this dichotomy is 100% accurate. And then whenever anyone said something, I would first think about it. Whenever anyone said something relevant to this question, I would first think of it in that frame with the hope of basically like, "I have this shiny new frame and the frame is probably not 100% correct." But in order to figure out how correct it is, it seems really healthy to start out by trying to interpret everything through it and see how often it needs to make concessions.

SPENCER: That resonates with me too because if you think about a physicist who comes up with some new theory, they need to be able to cleanly say whether the theory does or does not predict each new experiment. They can't suddenly combine it with the other theories when they're checking the evidence, right?

BUCK: Yeah, sometimes you do want to combine multiple theories. But I think that a really useful step is to check whether, in fact, your new theory does explain everything first.

SPENCER: Right? When you're trying to make your most accurate prediction, you want to combine multiple models together. But when you're validating a theory or trying to improve it, you want to use it in its pure form to see where it's getting things wrong.

BUCK: Another example of this is, when I'm thinking about things related to economics, I often try to, as far as possible, model everything with everyone being totally rational, and then see what that says (which has a bunch of really crazy conclusions). For example, I was thinking about wage theft recently. Wage theft is where, for whatever reason, you're obligated to pay wages to your worker, and then you don't pay the wages to the worker. If you try to think about this from a cartoonish Econ 101 perspective, you get these pretty wild sounding claims. For example, you start believing that wage theft is plausibly good on average, basically, or it's efficiency-increasing on average a lot of the time for reasons I could get into.

SPENCER: But that's basically you using your work through standard Econ 101 assumptions, and that's the conclusion you seem to get.

BUCK: That's right. Or maybe another example I was trying to think about this week was the improvements in factory safety over time. There's this question of, "Why is it that factories are safer now than they used to be?" and, "In as much as there are cheap ways that factories were able to improve safety, why is it that factories didn't compete for workers by doing the safety mechanisms? Given that workers — you would have thought — are strongly incentivized to not want to get injured or killed." I think it's not very parsimonious to try to explain the history of factory safety by pretending that everyone is entirely rational all the time. But it still seems really important to me that when you're starting to think about factory safety, you try thinking about it with the lens of everyone being rational all the time. And you think about no behavioral economics at all, until you have figured out which things were in fact not predicted by everyone being rational. And then you have to start thinking about behavioral economics or whatever. But I think that people are often too inclined to jump to behavioral economics — which is a much less predictive theory — much too quickly. And they should have started out by really checking which things actually needed to be explained.

SPENCER: When you say behavioral economics is a less predictive theory, do you just mean that there's so many more parameters to tune that it's harder to say what it predicts in a particular case?

BUCK: Yeah, that's right.

SPENCER: Right, because it's sort of a grab bag of like, "Oh, people fall for the sunk cost fallacy. And they also have loss aversion." And it's very hard to know, in any particular case, "How strong are each of these forces?" If the two of them come into competition with each other, "Which way Is it gonna go?"

BUCK: Yeah, exactly.

SPENCER: I see. I think about when you're trying to model the way people behave, you've got this bottom-up and top-down ways of doing it. This top-down is, as you say, assume rationality and then do an analysis. And then you can start deviating one assumption at a time from rationality. You're like, "Well, maybe people don't instantly have access to all the information. How does that change things?" And the other way is to go bottom-up, just start with these heuristics and biases and build up from there. I think they can both be really fruitful. I like to try to appreciate both ends. But I do see your point that if you start at the top-down (of assuming rationality) sometimes that will just fully explain the thing. And then you're like, "Oh, okay, that's why this thing exists, because even rational agents would behave this way. It may seem like a strange behavior, but it actually makes perfect sense when you think about it."

BUCK: Exactly.

SPENCER: Another topic that this conversation reminds me of: I've been thinking about this idea that some fields require tons of factual knowledge in order to understand them, whereas others require much less factual knowledge. Or it's like this sort of factual knowledge a lot of people know anyway. A contrast here would be like mathematics. It's really hard to do anything in mathematics until you've learned a bunch of notation. You just can't do it in normal English words, it doesn't really work, it falls apart. Physics would be another good example. But in organic chemistry there's probably a huge number of facts you have to know before you can be fluent in it. And then there are other things like psychology where I feel like a lot of people who've never done any kind of psychology could pick up a psych paper and basically understand it. Maybe they wouldn't understand some of the statistical terms, but they could basically understand what happened. I'm wondering if you draw that distinction between different fields like this. I find that in fields that don't require a lot of specialized knowledge, I feel very comfortable wading into them and even debating people even though I don't know the field very well because I feel like I can rely on a lot of other knowledge I have, and a lot of other heuristics and ways of analyzing things. Whereas in very fact-based fields, it's extremely hard to do that, and you're gonna get your ass kicked almost immediately because it's like trying to argue with the math division when you haven't studied math. You're totally out of your depth.

BUCK: That seems right. I think that to some extent, it's easy to wade into psychology because psychology is kind of a shallow field (or I wildly hypothesize this). I'm interested in management or running large organizations, as an example of a field, which — to some extent — doesn't rely on specialized knowledge. When I describe things that I find challenging about running a small organization or running my part of a small organization, it's not like any of them require particular jargon to understand, but I still feel like I would get my ass kicked if I talked to someone who really knew what they were doing with running technical research organizations or whatever.

SPENCER: Right? So maybe there's two different distinctions here: one is about the level of depth of knowledge, and the other is about just having to know a lot of facts or a lot of terms. Like with history, it's extremely hard to discuss history unless you know a lot of history. What are you gonna do, talk about complete generalizations otherwise? So maybe what you're saying is management is not very heavy on facts and terms, but maybe it is heavy on depth of experience.

BUCK: It feels to me that I can imagine there being a better management book that exists where the format was, basically, every page is a description of a problem that a manager had. And then it's a description of how the author of the book or how the manager in question reckons that you should handle this. It's just like a book of case studies in management that are four pages long each (I guess, I've gone up to four pages from one). And I don't know how it feels to me. In hindsight, most of the things I know about management (not that I know an enormously large number of them) could be taught to a substantial extent by just reading a large number of case studies. I would love to read that book. I read a book of case studies from the British public mental healthcare system a while ago. My inspiration for reading this book is: I feel like a lot of the time when you read a book, I have to worry that the authors are somehow biasing it, for some reason, like they're trying to sell me on some claim about the world being true. But if I read some book of case studies from the British mental healthcare system, it feels like I'm totally not the audience for this book. They're just writing it for people in the British mental health care system. And so, I'm actually gonna get something a lot closer to random representative examples of what it's like for people who interact with the British mental health care system. And I think I learned a lot of things about mental health from that. But I wish I could do this more often. I've also wished this for relationships. I feel like if I were 16, and I read a book of a hundred case studies of relationships between people who are 18, and 24, or whatever, that would have been extremely valuable for me. And I wish that such books existed more often.

SPENCER: This method sounds a lot like the method that Harvard Business School uses to teach the students. They call it the case method where essentially they give you a case, it's generally drawn from a real life situation, and you're supposed to ponder it. And then they have a class discussion about what you do if you were the CEO, or if you encountered that.

BUCK: That sounds great. Do they? It feels like the way to do this is going to have to include what actually happened in the real story so that you can get some kind of feedback.

SPENCER: Yeah, and it does. But I think my main concern about the case study method is that people generally don't agree on what you should do. And it's hard to know the counterfactual, maybe if you could do hundreds of examples (and they weren't biased) then you knew what happened. In each case, maybe you'd like to start to build this intuition, especially if you really tried to really say what you would do before you found out what happened. But I worry that what happens is (in at least some of the cases) the person writing it comes in with a preconception about what the right way to behave is and they write the case to push their perspective. And then, in fact — and people debate these in class — there's like five different answers about what you should have done for different reasons. So maybe it's just not that clear what the right update is on this.

BUCK: Yeah, That makes sense. There's this question of how much of the value comes from getting into different domains. You get different proportions of the value from getting better at different hard parts of the problem, or to state that more concretely: I feel like a lot of the mistakes I make are really basic, rather than things that are pretty complicated and difficult. I feel like the decisions that I most regret — a lot of the time — are things where I don't think you could have plausibly argued that the thing I did was the right decision, such that it feels more tractable. I think what you're saying is it seems difficult to use that method and get actual feedback on what you should do in really difficult cases. But it feels like a lot of the time, it's worth your time to consider how to get the easy cases right reliably. If I were trying to hire a manager fresh out of business school, I think I would be pretty satisfied just believing that they weren't going to really, really mess things up. Or just thinking that 95% of the time they get the easy questions right, and I'm fine with them not getting the hard questions right. So it also seems healthy to ponder them a bit as long as you think of the activity as pondering rather than trying to come to opinions about how to actually handle these things.

SPENCER: Yeah, maybe there's a bias towards picking certain more exotic cases, because it just seems more interesting and novel. But really, a lot of the things we have to do are boring things like, "Oh, you have to hire a person, you have to choose between these two candidates, who do you choose?"

BUCK: Yeah, you could actually get a bunch of unbiased samples from that particular question, but then evaluating them is hard.


SPENCER: We've transitioned to talking about learning, so let's go deeper into that. You have a lot of interesting thoughts on learning. So you have this idea of drilling small scale things, do you want to explain that?

BUCK: I think that, in general, when you're trying to learn some skill — and the one which is on my mind most right now is programming — it's really worth getting fast at the stuff which takes less than a minute. And one of the reasons for this is I think that humans are just systematically much better at learning things which are very fast to do than at learning things that have much slower feedback loops.

SPENCER: Just because of the amount of reps you can get?

BUCK: Yeah, basically, I think that in the in cases where there's some skill, and there's a sub skill, which allows you to get really, really fast feedback loops, like many reps of practicing that skill, you should be strongly inclined to try practicing that skill for a whole lot, and getting really good at it before trying to get the good at the other parts of the skill. For example, I find that a lot of software engineers seem to me to be bottlenecked on being fast at writing five lines of code or something. When you're programming, you spend a whole lot of your time trying to solve pretty trivial problems that are only going to be five lines of code. You've got your list of whatever it is, and you need to flip it around, and filter by the something something, and then sort by this and then put it over here. And I think that people being really, really fast at doing this noticeably increases your total productivity for a couple of reasons: One of the reasons is that you spend a non-trivial proportion of your total work time doing that set of activities. And so getting that faster is going to make you better. But the other reason is, I think that people find it easier to remember something for one minute than for two minutes. So if you're in the middle of solving a problem, and then you have to write these five lines of code, I think you lose a lot more of your state if it takes you two minutes, rather than one minute to write those five lines of code. I think it seems to me that whenever you can, you should focus on trying to get really fast at the small skills. I think an example of this (which is particularly on my mind at the moment) is people trying to learn about machine learning, where a lot of the time the advice they give you is to try and replicate some of these big papers, which I think is maybe a week of work for some. But it's hard to know whether you've succeeded. A lot of the time you have to run these experiments that take many hours to succeed at or to fail at. And even after the experiment doesn't work, you aren't really going to know why. It seems plausible to me that people in this kind of situation should spend more time drilling the basic skills.

SPENCER: So what would that look like in a machine learning context?

BUCK: Cutting up the small scale components of neural nets, rather than trying to train neural nets. And then the way you tell whether you've succeeded is by comparing it to a reference implementation. You try to make your big transformer language model like GPT II, and then you check that it does the same thing as the code which is publicly available, which does the GPT II stuff.

SPENCER: But you're talking about just one little part of it, not building the whole model?

BUCK: Actually, I think building the whole model is not that hard. Training it is very hard. But I think plausibly, people should just build up the code required for the whole model, which to be clear is like 150 lines of code or something. It's not very hard. It's kind of absurd, really. OpenAI released the code and it's this tiny file.

SPENCER: That's amazing. So you also have this idea of seeking out the smallest question. Do you want to tell us what that is?

BUCK: I think that when you're learning subjects, something that you should be really keeping your eye out for is the simplest question in the field that you don't know the answer to. I think a lot of the time people try to learn physics or something. And their approach is to try as quickly as possible to answer hard questions about complicated subjects (and I think this is what I thought was cool when I was younger). And, you know, they delighted in questions that were at the limit of fanciness that they could possibly answer. And it feels to me now that it's a lot more productive to really seek out questions that are as simple sounding as possible, while still being really hard to answer. Or still demonstrating that there's something you don't understand about this subject. An example of one of these is, so here's a classic question: Spencer, can I get an air conditioner that doesn't have an exhaust valve? It just sits in my room and it cools down my room. Can I buy that, Spencer?

SPENCER: I would assume not but I don't know.

BUCK: Why can I buy it?

SPENCER: I assume that something has to be expelled in order for the air conditioning process to work, but I don't understand it well enough to know.

BUCK: Yeah. Why is it that something has to be expelled?

SPENCER: You hit the limit of my knowledge but fortunately, I'm a mathematician, so I'm not too embarrassed.

BUCK: So it goes. All right. That's totally reasonable. But the air conditioner is connected to the outside world. Specifically, it's connected by the power cord. The thing I love about this point (which is due to my friend Peter Schmidt Nielsen) is a lot of the time people will answer that first question like, "Well, you know, the second law of thermodynamics, there's no way to expel, you'd have to be reducing entropy if you don't expel your heat somewhere." And then I love being like, "Okay, but what about the power?" [laughs] which I think, in fact, is just a really excellent argument, it's a correct rebuttal. In fact, you can use a power cord to make it so it's possible to have an air conditioner that expels warmth from the room just using the power cord, though it's gonna be a very slow and bad air conditioner. But the thing I loved about this question was it feels super basic in that I think that if someone asked me this question, I wouldn't have been in the 'well, maybe I don't know what's going on here' mode. I think I would have been like, "Well, of course, I don't know how thermodynamics works." And this question demonstrated that, in fact, I was missing some core details about how this works. I think that seems really healthy. It seems like we should be seeking out these most basic questions in the hope of finding holes in the foundations of our knowledge, because it seems really valuable to me — if you want to understand things well — to have the foundations really, really solid.

SPENCER: This reminds me of the — I'm probably gonna butcher the name of this — the Mpemba effect. Have you heard of this?

BUCK: No, what's that?

SPENCER: It's a claim that sometimes you can have warmer water that freezes faster than colder water.

BUCK: Yeah, I have heard of this.

SPENCER: Right. So if you just say, "How quickly does it get to freezing?" And it seems totally implausible that this is possible. My mental model says, "There's no way this could possibly be right, because for warmer water to freeze, it has to eventually get to the temperature of the cooler water." So there's a time it takes to get to the temperature of the cooler water, then it has to freeze from there, whereas the cooler water just starts at that temperature and then it's freezed. So it seems impossible, but then if you really start thinking about this, you realize there are actually all kinds of simplifying assumptions that you implicitly made, by that description of why this can't be possible. For example, you're assuming that the water is well-mixed, because — just as one possible example — suppose that your cooler water had ice crystals in it, that could somehow block the water from mixing properly, whereas the warmer water maybe isn't better mixed due to being warmer. So that's just one consideration. But you can have a lot of different considerations like this, that complexifies your model and make you realize that there's way more going on than you thought,

BUCK: Yeah, for sure. My friend, Beth Barnes, who is a researcher at OpenAI, spent a bunch of time collecting counterintuitive physics problems for some AI alignment research she was doing. And so she's the absolute master of asking me questions like this where I get them wrong initially [laughs]. And then realize how many simplifying assumptions I've made and that these, in fact, do not hold.

SPENCER: How do you think about what a simple question is? Because in math, there are some questions that seem simple that we just know, because mathematicians have tried to prove them or are ridiculously complicated secretly?

BUCK: Yeah, I think one thing that I'm interested in is trying to build up decision procedures. By a decision procedure, I mean some way of taking some class of question, and then formally specifying the set of questions that I claim I'm going to know how to answer. For example, suppose that you want me to do arithmetic, you're like, "Let's say an expression involves addition, multiplication, parentheses, subtraction, and division." I'm like, "Look, I fundamentally understand how to evaluate these expressions." If you're like, "Three times eight plus two, or 23 times 784 times 5703." I believe that I will never get conceptually confused about how to answer these questions. They might be too long and too hard for me to do especially if you require me to do them reliably, but I just do not think I will get conceptually confused. And the reason for this, basically, is I know how to answer the question in terms of decomposing it into subproblems. I know that the way you solve an addition problem is you add up the expression on the left, and you add up the expression on the right, and then I know how to add together two numbers. I think one way that I think about simple questions is trying to increase spaces of questions where I feel like I have decision procedures, so trying to make it so that I want to be able to claim that there is no question about basic mechanics (this is definitely false for me). But I think it might make sense to aspire there being no question about basic mechanics that involves only the following four concepts that I will ever be confused by. I think this is not a crazy aspiration.

SPENCER: For some set of four concepts, you're saying?

BUCK: Yeah, for some set of four concepts. For instance, I think that objects moving around in a vacuum exert forces on each other. I think that this is a fairly conceptually easy part of classical mechanics, where you can basically just think about Newton's laws and you get the right answer. Things get way harder if you're allowed to have objects that are stiff, or things that are spinning around and stiff. There's a suite of things that makes physics way conceptually harder. But your question was like, "What do you mean by a simple question?" and I think the way I approach this is trying to define classes of problems such that I believe I'm always going to be able to answer questions about them. And perhaps I should also say, it counts as a success if I'm always able to either answer questions correctly, or note that this is not inside my space of questions that I always answer correctly.

SPENCER: I find things a lot more interesting if I can't come up with a decision procedure to do them. If I can come up with a decision procedure, even if I can't do the thing, it's like, "Okay, well, I know how you could do this. I know how you would do this on the computer." So I'm not that interested in actually doing it, right?

BUCK: Yeah, the whole fun is coming up with the decision procedure, and then you're bored of that class of problems. And now you have a new class of problems.

SPENCER: It seems like in physics, some of the simpler questions are some of the hardest to answer, like stupid sounding questions like, "What is gravity?" And a physicist is like, "Oh, shit, I don't really want to have to answer that question." Or, "What is time?" Or, "What is time made of?" Or, "What is space made of?" Like the kind of questions a child might ask, that are profoundly dumb in some sense.

BUCK: I kind of want to be like, "It's not a physicist's business to know what space is." I claimed this as a confused question. I feel like asking what an apple is, is just a much better-defined question than asking what space is. Among other things, it's not clear how to operationalize it at all. So I don't know if I quite buy your thing.

SPENCER: Okay. What about, "What is gravity?" Is that a fair question?

BUCK: I feel like I am more personally drawn to questions of the form, "Am I able to accurately make predictions about gravity in a large set of situations?" And, "Am I able to notice when I'm actually in one of those situations that is outside the things I know how to model?" And I feel like maybe we could be like, "What are the simple laws that, in fact, describe every situation?" And that's a reasonable question. I don't know. I can imagine a good answer to that question, I guess. One of my friends is currently working on trying to improve the bounds on the busy beaver function. So she's trying to figure out exactly which five-state Turing machine halts, which of these short computer programs halt instead of running forever. And one reason that I think this prediction is really cool is that it's like trying to do all of math in order. There's a sense in which knowing whether a Turing machine halts is like the purest form of having a math question. And she's just trying to answer all the math questions starting with the simplest ones. We've figured out the halting of all of the machines of size four. And so she's working on ones of size five. In some sense, she's doing math in order of easiness or order of simplicity. And she's just working on those last however many questions.

SPENCER: Why do you associate the halting problem with math itself?

BUCK: You can phrase a certain large class of questions about math in terms of whether a program halts. For example, the question "Is every even number the sum of two primes?" can be phrased as "Suppose I have a program that, over and over again, looks at all the even numbers one by one. And then for each one, it tries all the pairs of numbers that add up to them and checks if both those numbers are prime. And if it ever finds an even number that is not the sum of two primes that it halts, does this program halt?" And the answer to that question is just the same as the answer to the math question of whether there exists an even number that isn't the sum of two primes.

SPENCER: So the Goldbach conjecture (what you're just describing) is sort of equivalent to that program halting?

BUCK: Yeah. And there's a bunch of other math questions.

SPENCER: So another idea — I'm not sure whether it's related — that you talk about is this idea of distinguishing between parts of the subject that you already know well, parts that you only roughly understand, and making that distinction clear. Do you want to tell us about that?

BUCK: Yeah. The way that humans' psyche works is that your brain implies to your conscious mind that you're able to visually pick out lots of details of the scene that aren't in the center of your field of view, and this is in fact false. But it doesn't matter to your brain. Right now, I'm looking at a photo on my wall, and my bookcase is off to the side. And my brain is implying that I can see the bookcase off to the side. My field of vision has content over there, but this is a reasonably fine thing for me to believe. Because if I wanted to actually look at my bookshelf, I would just look over there. But there's still the sense in which my vision — to a substantial extent — has a lot of illusory detail in it, like my mind is acting as if I can see things that I actually can't see clearly. And similarly, I think this is the way that understanding subjects works a lot of the time. I think that you might have hoped that when you know some statistical mechanics — but not enough statistical mechanics to answer statistical mechanics questions really reliably — and you get to the edge of your understanding, it's very clear to you that you don't really know what's going on. But I think that, in fact, the way that it often works (at least for me) is that my mind is like, "Oh, yeah, I know how to do this kind of homework problem or something." And I don't immediately have intuitive clarity on the fact that I really don't know what's going on here. And I'm just cargo culting some techniques that someone told me once. And I think that it's really valuable to be better at noticing whether you are in fact, cargo culting something or whether you in fact really know what's going on.

SPENCER: Do you want to explain the idea of cargo culting?

BUCK: Yeah, cargo culting is this term for when you're just copying the action someone else was making, rather than understanding why they were doing those actions, I guess I want to say. Maybe you can give a better description.

SPENCER: Yeah, I think I read about a person (Richard Feynman's book) and he talks about these island nations where people with much greater technology came and brought all this technology and in some ways, at least, brought a lot of conveniences and prosperity. And then what happened is, once the more technological group left, the people on the island — who didn't really understand how the technology worked — would start making things that try to replicate the technology. They would make [quote] "radios" and "landing strips" but they didn't really understand the purpose of them. And so the radios didn't actually work, they just looked like radios. And the landing strips didn't actually have planes landing on them, and they just looked like a landing strip. And so the idea of a cargo cult is, they were basically imitating the outer structure, but without getting the internal details right. So they didn't get any of the benefits of the structures. I think this comes up all the time. And part of it is that, what does it really mean to feel like we understand something? In a sense, it's just a feeling that we have internally, like when someone's explaining something to you and then you have this feeling, "Oh, I understood that." But did you really? And I think I experienced a lot of times people have that, if you try to explain it back to the person, you realize there's some gap that you didn't notice when you felt like you understood it. And so trying to explain it back, I think, is a really powerful way around this or trying to draw a diagram of the thing. Or if it's a method, then try to apply the method to something because this often happens in math class: You learn an idea, and then you apply it to solve a problem, and you realize you're stuck and there's something you didn't quite understand about it.

BUCK: Yeah, for sure.

SPENCER: I also think this happens a lot with everyday objects around us. If you think about "How does a toilet work?" I think a lot of people have a sense that they know how toilets work. But if they actually had to draw a diagram, they would really struggle. And my guess is a lot of them will get it wrong.

BUCK: Yeah, this is a great example. This one they've empirically done, and people, in fact, do not know how toilets work, even if you ask them if they know how the toilet works. I think can openers and bicycles are some other highlights in my mind.

SPENCER: It reminds me of that meme where it's like, "Imagine going back in time and teaching them about electricity." And then you show the person a rock, and someone's like, "Well, how does electricity work?" [And then you're] like, "Uh, I don't know." [laughs] For the final topic, I want to discuss AI with you, and especially issues around aligning artificial intelligence to make it operate on behalf of humanity instead of against the interests of humanity. And I know you've recently founded an organization, is that right?

BUCK: Yeah, I wasn't the person who put the initial impetus into it. But I'm currently helping to run it.

SPENCER: Cool. And Redwood, is it called?

BUCK: Redwood Research. That's right.

Right. Redwood Research, great. And do you want to just give us a little background on what you do at Redwood? And then I'll have some questions about the kind of things you learned around that.

BUCK: Yeah, at Redwood Research, we're trying to do a bunch of applied alignment research. So we're trying to answer applied machine learning questions where we're actually training models. And our basic approach here is we want to take problems that we're worried about, [those] eventually causing grave risks for humanity. We want to find the most analogous problems that occur today, to the problems that will occur later, or the most analogous technical difficulties that we think are analogous to technical difficulties that will cause existential risk later. And then we want to try and solve those problems using the solutions that are available today that are most analogous to the kinds of solutions that will work long term. So we're trying to solve baby versions of problems that we really will care about eventually, with baby versions of techniques that maybe will solve those problems eventually. So in particular, we're not focused on problems that we don't think will be analogous to things that are a big problem long-term. And we're not focused on solutions that we don't think will scale to larger systems.

SPENCER: I think there's something really appealing about that, which is that it's so hard to work on abstract things like, "Oh, there's this general class of vague problems I can't quite define and to try to find a solution." Whereas if you can pinpoint it as a concrete problem that can be specified today, you can actually try things, iterate, see if they work, see how they fail, and kind of move from there.

BUCK: Yeah, that seems totally right. I mean, you still have this difficult abstract problem of deciding which things are worth working on. I think there's no real way around the fact that you have to engage in futurism, in order to decide which alignment research is going to be useful. Guessing how the future of technology is going to develop is how we decided we should work on AI alignment at all. And so it feels like there's no way to get around the risks associated with basing your plans on your guesses about the future.

SPENCER: That makes sense. But at least on a day to day basis, you can be working on things that should have a more concrete feedback loop to them.

BUCK: Absolutely. A lot of the time with the techniques we're working on, there's no question that we will eventually succeed, or at least there's no question that someone competent would eventually succeed. The question is just like, "Is it going to be by using technique A, B, or C?" And there's something pretty satisfying about the knowledge that we're just trying to develop some technology that is almost surely developable, rather than trying to solve a problem that might be impossible.

SPENCER: Can you give an example of the sort of thing you work on?

BUCK: Yeah. A problem that we're working on at the moment is trying to make a classifier that takes a snippet of fiction, and then it tells you whether the last sentence involves someone getting injured. Our goal is to make it so that we know of literally no cases and people can't find any cases where this classifier thought that something didn't involve someone getting injured, but it actually did.

SPENCER: It's funny, I saw someone post about this to Facebook, I didn't realize this was your project, where they were giving an example of a parent slowly and lovingly suffocating their child to death and [inaudible], right?

BUCK: Exactly, yes. The way that our project is a little bit weird is we're trying to have it so that there are no false negatives. And we're okay with a lot of false positives. So we're currently pushing for the goal of no false negatives and 60% false positive rate.

SPENCER: So that means that anytime you have something that involves someone getting hurt, it always will say that a person was hurt, right?

BUCK: That's right. And we are willing to sacrifice also having false alarms on a very large proportion of our input data.

SPENCER: Right. If you were willing to have 99% false alarms down that's probably really easy to do. but doing it at 60% false alarms, that's actually probably really hard.

BUCK: It's actually pretty hard at 99%. Or it's somewhat hard at 99%, as well. Most people don't try very hard. Neural nets are kind of complicated to get universal behavior from. The example you were saying (about the parents slowly and lovingly smothering their child), neural nets just really love to learn basic heuristics about what's going on in text. It's just actually a pretty good guess that if someone talks about slowly and gently and lovingly doing something, probably that didn't involve any injuries happening. And so to get it to be suspicious of texts like that, you really have to train it on a bunch of examples that were generated by humans who are trying to mess with the model. And once you start training the model on adversarial examples that people have found where it was making mistakes, then it gets a lot more suspicious of stuff like that.

SPENCER: That's interesting because the training data set is a very weirdly undefined thing in this case. The training data set has sort of the weirdest cases humans can come up with that trick it, rather than being drawn from some kind of fixed distribution.

BUCK: Yeah, that's right. So our goal is to be reliable on any text that seems like normal human text, where by normal human text, I mean it has to not be complete gibberish — it's possible to trick the model by giving it complete gibberish, and I think I'm defining that to be not part of the problem — but yeah, you're right. Part of the difficulty here is that we are trying to be robust to humans trying to trick us.

SPENCER: That's a really neat problem. In a sense, you're trying to find at least one case where adversarial examples don't work.

BUCK: That's right. So traditionally, it's always been doable to find cases where your model makes a confident, wrong prediction. And here, we're trying to make it so there are no such examples.

SPENCER: I've wondered whether the existence of adversarial examples sort of hinges on the way neural nets are defined, because imagine you take an example, and the classifier gets it right. And then you try tweaking it in small ways in different directions, and then see which way pushes it towards the wrong label the most. It seems like you can construct an adversarial example from the bottom up through this kind of mutation process. You keep mutating it a little more, so the cluster gets a little more wrong a little more along. And the way this looks is if you're taking an image classifier, and you start with a picture of a cat, and you want to get the classifier to think it's a duck, you can actually do this. You can keep adding little bits of noise that to the human eye aren't detectable but to make it more and more and more duck-like to the classifier. And it seems like that's something about differentiating the entire neural net with respect to the example. And just walking that gradient.

BUCK: Yeah, that's totally correct. Things are a little bit different with language. Because there's no continuous word space. There's no way to take the word 'through' and make it slightly more like the word duck.

SPENCER: But if you think of each word or each token as having a vector embedding, then you can do a discrete version of that. You could just, say, swap it with the nearest vector embedding.

BUCK: This is fairly reasonable. We have in fact done things like this. The thing which is tricky is: Suppose I have some sentence and the sentence is, in fact, injurious. And then I want to tweak it bit by bit until the classifier thinks it's not injurious. The basic problem is that one easy way of doing this is tweaking the sentence so that its meaning is different and, in fact, is no longer injurious.

SPENCER: Right, so you have to constrain it under the constraint of still being an injury or not being less of injury? You need to find a direction to move it where it becomes less like an injury towards the classifier, but not actually less like an injury, right?

BUCK: Yeah. And the crux of the whole project is: How do you know whether it has become less of an injury? If you have some other things sitting around that will tell you whether there is an injury in a text, then you should just use that instead of your classifier. There's this 'no free lunch' theorem. For the first month of this project, we would just — over and over again — come up with plans that we would eventually realize ('eventually' meaning after one minute) hinged on some other way of knowing whether a snippet was injurious or not. And then maybe you can train the classifier on those or something. But no, actually all you've got is that's just a classifier. If it'll tell you whether a snippet had an injury, then that just is a classifier. Does that make sense?

SPENCER: Yeah. But I think there are ways around this. Take the example of a picture of a cat and you want to trick a classifier and think it's a duck. You might know that certain operations will never turn it into a duck. If I add a little tiny bit of random noise, and I only give myself a certain number of those random noises to add, then I know I can never turn it into a duck. And so, as long as I'm just constrained by that, anything I do in that way that makes the classifier think it's more like a duck is moving towards an adversarial example. So if you had a way of transforming text in such a way that you knew it was never becoming less injurious, but you still had a relatively high dimensional space to move around in, that would seem to be a way to break this, right?

BUCK: That's totally correct. But if we could do that, then we could also use this as part of our classifier. So here's a classification procedure: you take some text, and then you tweak it using rules that you know do not change whether it was an injury until you make it look as dangerous as possible, and then you see whether this new modified snippet is in fact dangerous according to your classifier.

SPENCER: So then you'd want to integrate that basically into the system.

BUCK: Yeah. Basically, my no free lunch claim is there's a fixed amount of knowledge about injury, and a system which knows whether true snippets mean the same thing, you can just directly apply that to your problem.

SPENCER: Hmm, interesting.


SPENCER: So I know you've been doing machine learning stuff for a while now. What are some of the things you changed your mind about in the last few years when it comes to machine learning?

BUCK: So I have some hot takes on this topic. I'm gonna say it. I think that machine learning people agree with me somewhat, but not entirely on this. When I was first getting interested in machine learning (I first seriously started to learn deep learning stuff in 2017), it felt to me like deep learning was just this crazy and unprincipled bag of tricks, like people would have all these techniques that made no sense and there was no possible justification for them. And my impression of the field was that the key skill involved in doing machine learning research is somehow developing this weird intuition, this weird ability to divine what ways you need to tweak your neural nets in order to train them. And I now have a pretty different belief about the core problem with machine learning. My sense with machine learning now is that the core problem is that it's just a very difficult engineering challenge. Running experiments is hard, the experiments are slow, you care a lot about the experiments going quickly, and it's computationally expensive to do machine learning. And so, you have to use lots of software that is optimized for running really efficiently rather than being easy to use. The computers that you're running your experiments on are expensive to rent, because they have big GPUs in them. And so you have to be a lot more careful, for various reasons. Another problem here is there's no real way of knowing whether a change that you made to your code has done the correct thing or not. A lot of the time you try being like, "Well, what if my neural net was shaped like this, instead of that?" And suppose that the new way of shaping your neural net didn't go very well. It's hard to know whether this is because you coded it up wrong, or because this new way of shaping the neural net is, in fact, bad at learning things. And so, my new perspective is that machine learning is mostly an execution problem. It's mostly just, in the same way, as if you wanted to build a bunch of crazy lasers in a lab that are in very straight lines or something. There's just a lot of mechanical difficulties with making this happen. And it's mostly about execution. And that's my new perspective on machine learning.

SPENCER: I'm somewhat sympathetic because we do some machine learning in our work. And so many times where we just can't tell if a machine learning algorithm is working properly. We're like, "Okay, we think it's working. But it's not getting as good a performance as we hoped. Does that mean there's a bug? Or does it mean there's something misguided in it? Or we need to tweak the parameters of it? Or do we need to change the shape of it?" And even just things as simple as setting your hyperparameters like that could be running something for 30 hours. And then what if the range you were trying was not right? And there's all these really annoying, pragmatic questions, and you're doing machine learning research that just seems like not the sort of thing you'd expect.

BUCK: Yeah. But in particular, my claim is: The main difficulty is, in fact, just doing kind of obvious experiments. A lot of the time if you want to know how you should set your hyperparameters differently, it's not impossible to know the answer to this question. You just have to do things that are fiddly. So you might want to look at how quickly the weights in your different neural net layers are changing over time. And you can in fact, learn things about how your hyperparameters should be set differently by looking at diagnostics like that. And the thing which is hard about this is, fundamentally, it's really annoying to implement diagnostics. It's just a hassle to implement these diagnostics and the diagnostics have to run quickly. And the natural way of writing them doesn't give you a way of testing them until you've already trained your model. And it feels much more like an execution problem, where, "If only I could just do all of the ideas that are kind of obvious, I would have solved most of my problems." Whereas before, it seems to me more like the challenge was coming up with good ideas, or defining how I should tweak things so that they would work better.

SPENCER: Does that mean you're putting more time into tool-building?

BUCK: Yeah, it at least means that I should be putting more time into tools.

SPENCER: [laughs] Well, if you model yourself in the first order as a rational agent, then you're putting more time in tool-building?

BUCK: Yeah, for sure. I would say that compared to how I used to look at these things, I've updated towards thinking we should spend more of our time trying to get really fast at executing the basic tasks of machine learning, of getting diagnostics, training things conveniently, testing our code to make sure that we aren't making dumb mistakes, and so on. And I've also updated towards thinking that it's easier to get value out of hiring less experienced people, as long as those people are very good at programming, because it feels to me like most of the problems are basically simple programming problems.

SPENCER: Right. Like good software engineering, like unit testing, and things like that?

BUCK: Or it's not simple software engineering but it's just difficult software engineering, rather than difficult software engineering which is also extremely conceptually entangled with complicated, massive intuitions. The intuitions feel less important to me now.

SPENCER: Any other changes in your view on machine learning?

BUCK: I don't know, a lot of stuff seems a lot more principled than it used to be. It feels like a more elegant subject than it used to.

SPENCER: The methods people are using, they now have better justifications for why it makes sense to use them rather than just feeling like a hack or grab bag of tricks?

BUCK: Absolutely. It no longer feels like a grab bag of tricks. It feels when I read a machine learning paper nowadays, I'm usually like, "Yeah, this is a pretty solid idea. It's not massively galaxy brain. I can imagine thinking of this idea kind of quickly." It sounds like a pain to actually make it work out in practice, and props to the authors for making it work out in practice. And I'm sure that of all the ideas that would sound as good as this to me, only one in 10 would actually do something good. And so, we have to actually try a lot of things before it works out. But fundamentally, this isn't an absurd mystery, or it doesn't feel like there's no way I can imagine coming up with this thing on my own.

SPENCER: Is this because people have dropped using the methods that are harder to understand or because we've come to understand the methods that we used to not understand?

BUCK: I think it's some combination of the deep learning practitioners have figured out a bit more about the fundamental theory of what they're doing and the unifying themes, and also, I think I personally have just learned more about this.

SPENCER: What do you think some of those unifying themes are that maybe people didn't know five years ago?

BUCK: By 'people didn't know,' I think I really want to mostly go from the claim that I didn't understand. And I think my claim here is less that really legit machine learning practitioners were wrong. I know less about machine learning than actually very legit practitioners. But I at least was wrong, and I think that the ways I was wrong are shared by a lot of people who are passingly familiar with machine learning. Some things which feel like key themes in machine learning that didn't feel as obvious back in the day: one example of this is variance reduction or techniques where you trade bias for variance. Do you know about 'importance sampling'? Should I say what an 'importance sampling' is?

SPENCER: Yeah. Can you tell us what that is?

BUCK: Yeah. Suppose I want to estimate the average of some random variable, maybe I want to estimate the average income of Americans. One way of doing this is just randomly picking Americans one after another, and asking them how much money they have or how much money they make per year, and then taking the average of all these Americans. But there are faster ways of doing it, or there are ways of estimating the average income, which are as accurate in terms of how much you expect to be wrong after if you have some limited number of people whose income you can ask. So suppose I have these two states, I've got California and I've got some poor state, New Mexico. And let's say that I know for whatever reason that people in California are higher income. And let's suppose that the income distributions look the same, but just scaled up in California. So this, in particular, means that as well as the mean of income being higher in California, the standard deviation of income is going to be higher in California. I want to estimate my average income for the nation by first taking a statewide average income for these two states, and then mixing them together. And there's this question of, "How many of my samples should I draw from California? How many people's income should I ask about in California versus in New Mexico?" And you might naively think that the answer is, "Well, you should take an equal number of samples from each assuming that these states have the same population." But it turns out that we can get a more accurate estimate if we take more samples from the state which has a higher standard deviation of income, which, in this example, is California. And this is basically because there's diminishing marginal returns to getting samples from a population. And you would rather have a more accurate estimate of the state where there's going to be more variance in your estimate,

SPENCER: Right. I think this becomes clear at the extreme. Imagine, for some weird reason everyone in New Mexico has exactly the same as everyone else in New Mexico. Then all you need to do is sample one person, and you'd completely know everything about the state. So that shows why the more variance there is, the more people you want to draw from that state.

BUCK: Yeah, that's, that's a really great example. That's a really great way of putting it. And so, I claim that a whole lot of stuff in machine learning is basically just using this trick over and over again.

SPENCER: So what's an example in machine learning that you think is essentially using this?

BUCK: Okay, so here's an example of a trick in machine learning that is basically trying to allow you to reduce variance in a way that means your thing learns faster. You probably know about how AlphaGo works, right? AlphaGo is this Go-playing machine learning system that tries to pick good Go moves by playing a bunch of games against itself. So how do you learn whether a Go move was a good move? Here's one strategy: You can have the system play a bunch of games against itself, and now you have a big data set of moves that the Go bot made. And then, for that move, whether it eventually won the game or not. I can just play a game, and then write down a list of all the moves, and then annotate if white won this particular game. I can just annotate all of the white moves with, "And then eventually, we won the game." And annotate all the black moves with, "And then eventually, we lost the game." And then, when we're trying to figure out how to update our strategy, when we're trying to figure out how the neural net should change how it plays in future, you can just tell it to update a little bit against all the moves that the losing player did, and a little bit towards all the moves that the winning player did. And this, in fact, is going to eventually work. But it's also going to be really noisy and slow. And the key reason why this is really noisy and slow, is that if you play this really long game, and you only make one mistake, or that was one move which was strategically brilliant, the approach I just said is not really going to pick that out. It's going to cause you to update as strongly towards all the moves that you made, even the ones that didn't matter very much. And this means that it's much slower. It's kind of a sadly slow algorithm. Does that make sense?

SPENCER: Yeah, absolutely.

BUCK: So here's a way that we can get around this: We can train another neural net. What this neural net does is it takes a Go board, and then it predicts the probability given this Go board that white is going to win. And now, if I have and I can train this using the same data set that I was using for the previous thing, I can just play a bunch of games, and then annotate all the boards with who eventually won. And now the thing that I can do — which is great here — is whenever we play a move, instead of judging a move based on who eventually won the game, we can judge the move based on whether the other neural network trained thinks that as a result of that move, your probability of winning went up. And the nice thing about this is that it means that there's much more specific and immediate feedback.

SPENCER: Nice. And so effectively, is this equivalent to reducing the variance in the learning process?

BUCK: Yes. So this is just a variance reduction technique, and here's why: The value function (the function that takes a board and tells you the probability that white will win) is just equivalent to what would happen if you played out from that board position an infinite number of times and counted how many times white won. The value function is going to be the same as the expectation of just playing the game once and seeing who wins from that position. But using the value function is going to be lower variance. Does that make sense?

SPENCER: Right. Because if you just look at if they won that particular game, then you're essentially just getting one data point to sample because you could imagine that game could have gone different ways. Maybe if you played that game from that point 100 times, one of them would have won 70 times. But if you're using this probabilistic estimate of the chance of winning, it is essentially taking into account a whole cluster of possible future games that could happen from that point onwards, instead of just one random sample from the possible games.

BUCK: That's right. And the one random sample from the possible games has the right expectation. It's an unbiased estimator of this true quantity that we care about, but it's much higher variance. And so, you're going to learn much worse from it.

SPENCER: Yeah that's a nice way to put it.

BUCK: I think this is probably the cleanest example of a variance reduction approach. You do things like this a whole bunch of other places in reinforcement learning as well.

SPENCER: So maybe the idea is that there's a whole bunch of fundamental principles like the mean, variance trade off, and importance sampling, and so on. And what a lot of these tricks amount to is some way of relying on this principle that may not even be obvious at first how it's relying on that principle, but somehow it's using it to get better results.

BUCK: Absolutely.

SPENCER: For the last topic, I wanted to get your thoughts on AI alignment in particular. How do we build a beneficial advanced artificial intelligence? And I think you have thoughts specifically on the use of resources in that kind of research?

BUCK: Yes. There's now like many tens of billions of dollars that philanthropists have that they would like to spend on reducing existential risk from AI. That's enough to afford a lot of people doing a lot of stuff. And so, I'm interested in the question of what it would look like to, in fact, be deploying a bunch of these resources long-term in a way that reduces the probability of extinction from AI. Yeah, so I've been thinking a lot about this recently.

SPENCER: So basically, how do we use money to reduce the risk that something goes really horribly? So what are your main thoughts on that?

BUCK: I think the main thought about this — that I've been really interested in recently — is trying to think about AI alignment efforts split into things that we want to be doing at the end and things that we don't want to be doing at the end, where by 'at the end' I mean when we're building these really, really powerful systems that radically transform the world. So I think that some of the stuff that we do in AI alignment is stuff that we are gonna be wanting to do at the end. For example, some types of AI alignment research are interpretability work, where you take some model that someone has made, and then you try and probe around with it and figure out exactly what it's doing and why. And I think that this is an example of a capacity that we're going to want to have a lot oF at the end. When we're building these really powerful systems, we're gonna have these powerful systems, we're gonna want to know if it's safe to deploy them. And we're gonna want to poke around at them a lot. And another example of some capacity that I think we're going to really want to have is the ability to take a model and try to really red team it — really look for cases where it does things that are as bad as possible — so that we have a sense of whether there are any cases like that. Those are both examples of things where we're going to, in fact, want to be able to do these things a lot when we're actually building really powerful systems.

SPENCER: So as you do that kind of backward chain. Say, "Okay, imagine we have something that's on the verge of being a super intelligent AI, what are the things we want to do with it? And what are the things we want to make sure that don't happen with it?" Is that right?

BUCK: Yeah, that's right. Like what are the actions we're going to want to be taking on the eve of deploying these really powerful systems or whatever. And maybe we should start practicing those now.

SPENCER: So if you use this form of reasoning, what else is it going to tell you are important things to work on now?

BUCK: So I think that interpretability and red teaming are two examples of capacities that we are going to want to be actually deploying a lot of when we have these really powerful systems. And I've been thinking recently about this in contrast to some activities that we are not going to want to be doing very much at the end. For example, I have some coworkers who spend a lot of time thinking about the theory of AI alignment, or in theory, "What are algorithms we could use that lead to good behaviors?" And it is definitely not my hope — unlike the previous things — that we are spending a lot of resources on this kind of theory work on the eve of deploying the AI because basically. It is much too late by then. So I've been thinking recently of trying to break things down into things that we have capacities, we actually need to build up at scale, and types of research that we should be doing early but expecting to scale down later. And I think the main example of the research that I think that we should be excited about now but should be hoping to scale down later or hoping to finish early, is the question of: Are there algorithms that we can develop that look on paper like they're going to have the properties we want?; or, Are we going to have to be in the more winging it situation, where we don't have a principled reason to believe that our systems will be aligned, but we've built lots of tools for investigating them, and we've searched really hard for trying to construct examples where they do bad things, and we haven't found any examples, and so we feel pretty confident in them?

SPENCER: So dividing things up into: these things that we probably will want to have really good capacity at towards the end before the system is deployed and these things where we really have to finish them much earlier. How does this change your view on what should be funded or how funders should think about this?

BUCK: I think the main thing is, it makes me feel excited for the type of work that I'm doing, where we at Redwood Research are really thinking of ourselves as trying to do things that are as analogous as possible to some of these capacities that we're gonna need a lot of in the future. We're trying to do the fire drill for really, really caring about finding out whether a model does bad things under any circumstances. So in our case, we have this dumb model that doesn't understand lots of things, and it doesn't matter at all if it is in fact wrong about stuff. But we're practicing pretending that we really, really cared that this model never makes a certain type of mistake. And I hadn't until pretty recently conceptualized this, as in the fire drill kind of sense or just practicing doing one of these actions that we're in fact going to want to do a lot of at the end.

SPENCER: I like the kind of work you're doing. I think that's a really interesting perspective. But I am a little confused about how you get there from this way of thinking, because you could also seem to make the opposite argument saying, "Well, if there's work that has to be done really early, because it can't be done at the end, we better do it right now. And we're going to focus on that because we won't have time later on."

BUCK: I totally think we should also be doing that work. It feels like one important difference is that we probably are never going to need to build as much capacity for that type of work as for these types of work that we are going to be doing at the end. But I'm very excited for people focusing on the types of work that we have to do now, and we share an office with several such people, and that seems great.

SPENCER: How did you pick the type of research you're doing? What makes you decide to work on that, rather than the more theoretical side?

BUCK: Mostly personal fit or fit of our team?

SPENCER: Okay, so you're not arguing that everyone should do what you're doing. But just that maybe the way to think about this is there are two different types of research that you believe should be funded? And so, funders can think about, "What are the things in the theory space that — if they're gonna work at all — we've got to do them on the sooner side?" And then, "What are the things that we're going to want to have really good capacities at in the event that AI gets super advanced?"

BUCK: Yeah, for sure, that seems right.

SPENCER: Thank you so much for coming on and this was really interesting.

BUCK: Yeah, thanks for having me.


JOSH: Is there something you do every day that almost nobody else does?

SPENCER: It's hard to know for sure what nobody else does. But there are some things that come to mind that I think are really rare. One of them is a prediction book to make predictions about important things happening in my life and what probability I think they'll happen with. And I found that a very interesting exercise. When I have a big event coming up or something I'm not sure about that's significant, I'll try to make a prediction about it. Another thing I do is I run Twitter Polls really, really frequently. The reason I do this is because I find that they can be a fascinating glimpse into human psychology. Keeping in mind the demographics so you have to keep in mind who's gonna respond. It's not a representative sample of the world, but it is a population. It's interesting to me because there are people that are interested in my work. And so, I actually try to make predictions about how people will reply. Under most of my Twitter polls, I make a prediction of what percentage of people I think will give different responses. And I use this as a way to learn about psychology, and also to give people interesting questions to ponder (which I think they find valuable). A third thing that comes to mind that I do that, I think, is pretty unusual is, when I'm trying to learn about a topic but I want to understand it on an intuitive level, I do this method that I call 'intuition flooding' sometimes. And the way that intuition flooding works is I'll get a ton of examples of that thing. And I'll just look at one example after another after another until my System 1 (my intuitive part of my brain) starts being able to model or make predictions about that thing. An example is I wanted to better understand behavior change interventions, and I found this database of real world cases where behavior change interventions had been implemented. So I just went through every single example and the whole database one by one, just seeding my intuition with all of these different examples of real behavior change occurring.




Click here to return to the list of all episodes.


Sign up to receive one helpful idea and one brand-new podcast episode each week!

Contact Us

We'd love to hear from you! To give us your feedback on the podcast, or to tell us about how the ideas from the podcast have impacted you, send us an email at:

Or connect with us on social media: