with Spencer Greenberg
the podcast about ideas that matter

Episode 044: AI Safety and Solutions (with Robert Miles)

Enjoying the episode? Want to listen later? Subscribe on any of these apps or stores to be notified when we release new episodes:

May 23, 2021

Why is YouTube such a great way to communicate research findings? Why is AI safety (or alignment) a problem? Why is it an important problem? Why is the creation of AGI (artificial general intelligence) existentially risky for us? Why is it so hard for us to specify what we want in utility functions? What are some of the proposed strategies (and their limitations) for controlling AGI? What is instrumental convergence? What is the unilateralist's curse?

Rob Miles is a science communicator focused on AI Safety and Alignment. He has a YouTube channel called Rob Miles AI and runs The Alignment Newsletter Podcast, which presents summaries of the week's research. He also collaborates with research organizations like the Machine Intelligence Research Institute and the Future of Humanity Institute to help them communicate their work.

JOSH: Hello, and welcome to Clearer Thinking with Spencer Greenberg, the podcast about ideas that matter. I'm Josh Castle, the producer of the podcast and I'm so glad you've joined us today. In this episode, Spencer speaks with Robert Miles about the urgent need for AI safety research and education, useful models and examples for communicating about AI, our collective ability to coordinate AI development, and strategies for limiting the impact of AI on the world.

SPENCER: Robert, thanks for coming on.

ROBERT: Thank you for having me.

SPENCER: A lot of popular YouTube channels are about things like people falling over and hurting themselves or maybe cute animals. I'm really intrigued by how you managed to be successful on YouTube, can you tell us a bit about that?

ROBERT: YouTube is a really diverse platform. There's a lot of different kinds of content and there are a lot of different approaches. And I think a lot of the problems with YouTube are actually the same problems that television has, which come more from the business model. Fundamentally, an advertising-based business model is going to affect the content that you produce because you're trying to get as many eyeballs on it as possible, and that's really the only metric that's considered. And so, in general, the quality of science communication on television was usually not that high. Because it's just not something that creators are really incentivized to optimize for. But you can produce television under different business models, like a documentary produced by the BBC, or by PBS, or things like that that can potentially be much higher quality than the type of science communication that you would get on cable news channels, or any other sort of channel that's just purely optimizing for the number of views they can get. And that's why I feel so grateful to my supporters on Patreon and the long-term future fund who have given me a grant to produce my work. Because it means that I have the freedom to just try and make good quality work, rather than just trying to get as many eyeballs on my work as possible. That said, I think people overestimate the degree to which you have to simplify things. So I think there is actually a real demand for more detailed and higher quality explanatory content.

SPENCER: I'm astounded by the fact that you have some videos on technical AI safety topics about how to make the world safer from artificial intelligence. They have close to a million views, and this kind of breaks my model of things. Are there really that many people that are excited by such technical subjects?

ROBERT: Yeah, this stuff is really, really interesting and really, really important, so it doesn't surprise me at all that people want to hear about it. I think AI safety is genuinely the most interesting topic in the world by a lot (I don't even know what second place is). If you think about the big questions of philosophy and the questions that you're asking on this podcast — questions about "What is thinking?" and "How can we do it better?" but also all of the philosophical questions about values, "What do we actually want?" "What are we even going for as a species or as individuals?" — all of these questions are tied together in the question of AI safety, about how you create intelligent systems that will do good things. You have to get to the bottom of what it means to be intelligent, what it means to do things and what it means for things to be good. And what's more, this is a question that we have to answer because there's a tremendous amount at stake. This seems to be the question of this century, perhaps. This is the point where humanity achieves all the things that we're hoping for, or destroys itself completely or worse. And it hinges on how well we address these questions. And so, it's not at all surprising to me that people would be fascinated about this. This is a collection of all of the most interesting questions in philosophy, with giant giant stakes, and a deadline that's plausibly in our lifetimes. Of course, people are interested.

SPENCER: That's very well said. I'm persuaded and I'm interested. Can you tell us about how you run your YouTube channel? Where do you get your ideas from? I know you also have a Discord server.

ROBERT: Yeah, so mostly, I run my YouTube channel pretty poorly. I try to maintain a high standard of quality. But that means that I get kind of bothered by perfectionism. I usually think that if I'm going to make a video about a subject, I want that video to be the best video about that subject.

SPENCER: That really resonates with me. That's the goal I try to always set for any team I'm on because if we're going to do something we should be trying to do the best in the world — it doesn't mean it's the best of worlds for all people, maybe as a researcher it's better off going and reading a paper about it — but for some audiences it should be the best in the world for that topic.

ROBERT: Absolutely. I feel like if someone came to me and said, "I'm interested in learning about X," And I would recommend some other video about X, then why would I even make my video? So that means that I end up producing things quite slowly sometimes, or I often will get part way through something and then throw it out, which means I have a very sporadic release schedule. And that's why I'm so grateful that I don't have to rely on YouTube (YouTube is not a source of income for me). I view it as basically a free video hosting platform that I'm very grateful to have. But YouTube wants you to upload very regularly and it's not nearly so concerned about quality, so I'm glad not to be a slave to those metrics.

SPENCER: It's a really interesting model (the Patreon model) where more and more independent creators seem to find a way to support their lifestyle of creating great content that people love, but not being beholden to having to find some way to kind of slot it into the way YouTube does its algorithm and payment scheme or slotted into some corporation willing to pay them for their work.

ROBERT: Absolutely. There's a sense in which it's a big step forward and there's a sense in which it's a return to how these things used to work in a way (they did this with the name). But the way that the arts used to support themselves was by patronage — although that was usually a small number of large patrons, rather than a large number of people paying less — but being supported by advertising was an experiment, and an experiment that didn't really work out that well for everyone. And I think it makes sense to move away from it, you know?

SPENCER: Yeah, and I think one of the interesting things about your work is that a lot of people see your ideas, this is probably the only format they're gonna see them in. While they might wish, they could say, "Oh, I read this in some really complicated paper," and use it as evidence of their brilliance and understanding these really technical topics that are hard to understand, In reality, they're gonna learn from a YouTube video, and I know you have some thoughts on that.

ROBERT: Yeah, absolutely. This is a big part of why I think that YouTube is a great way of communicating research. Because people, as you say, are not honest about — and this sounds like I'm really criticizing them, I don't, I do this myself — but generally, people will say, in conversation, "Oh, as such, and such a book says, this argument or this position, or this idea," when in fact, they haven't read that book. They've read a summary of the book, or a blog post about the book, or they watched a video about it, or they read a tweet thread or something like that. And people will give as the source of their information the highest status form of it or the highest status thing that they think they can get away with. And this means that when people think about the general landscape of ideas, and the transmission of ideas, those media of communication are overrated, because they're over-reported. And in fact, the more lowbrow means of communicating have a bigger influence than you would think just by observing how people report their ideas. That's part of why I'm focusing on YouTube. I think that it's actually a really neglected area for communication and you have to be prepared to take a bit of a status hit to yourself to do it.

SPENCER: I was talking to a very successful author and asking him questions about his work. And he's just saying to me what I thought was fascinating, which is that he thinks very few people read his books even though they're really successful. He thinks that people buy them, read a tiny bit of them, maybe read summaries, and then go discuss them at dinner parties. Mainly buying his book is kind of an excuse to talk about the topic and it sets a shelling point of conversation. And it's like, "Oh, have you read such and such book," but it's very cynical on the one hand but on the other hand, it also highlights the importance of other ways of learning this information like yours.

ROBERT: Yeah, absolutely. I think it's easy when you're taking Robin Hansen's view on these things to get very cynical about this but I actually don't attach moral judgment to it at all. People are busy, and books take a really long time. And probably the majority of books would be better as a blog post. And the majority of blog posts would be better as a tweet. The majority of tweets would be better not tweeted. This is not mine, by the way, but I don't remember where that's from, but I think it's really true.

SPENCER: You probably heard it in a tweet.

ROBERT: It was almost certainly a tweet. But having the physical book in your house serves a useful function, right? It tells people about the kind of person you are and as you say, it's a nexus for conversation and so on. It's something that I've found, since I basically don't have physical books anymore. The only books I have are the ones that people send me, basically. Because I do everything with eBooks, and so I really want somebody to make a company where you send them your Kindle library or your Goodreads profile or something like that. And it pulls out the books and you can select which books you want to have. And they will then print and send you a really nice poster that looks like a bookshelf that has all of your books on.

SPENCER: Then you could put it in the background and in the Zoom background?

ROBERT: Yeah. And it's not like, people will see that and think it's books. People come to your house (you know, when we're through all of this) and people can come to each other's houses. They see the poster on the wall and they're not going to be fooled that it's books, but they're going to serve the same function of books that people can say, "Oh, you like this book, I have this opinion about it." And it'll start conversations in the same way. Something to replace the physical bookshelf when people aren't actually reading the books, like I read books, but I read them digitally. I don't think I've read any of the physical copies that I have on my shelf. But I have read those books, I just haven't read them [laughs]. I've read the text but I haven't physically opened those books.

SPENCER: Can you tell me more about the community that you built?

ROBERT: Oh, yeah. So this is like a new project that I'm currently working on. Because I realized that the YouTube channel has been doing fairly well and it's been getting decent numbers of views and comments and things but it's kind of not exactly a dead end but there aren't really obvious places to go from there. So if I'm trying to get people interested in the subject. Then, I think there are a lot of people who watch my videos and enjoy them, and would be interested in doing more things, but nothing immediately presents itself. And, so then they go on with the rest of their lives. And they have interesting topics to talk about (maybe at dinner parties) but it hasn't influenced their lives. And I was thinking about how I could improve that. And I thought it would be good to have a community to have people who come into AI safety by watching my stuff. Then have a place to go, a place where they can talk to each other that's better than YouTube comments because YouTube comments are kind of broken as a forum. They're not nearly as bad as they used to be. But the software is not that good. It's not very good at keeping track of things, it's not good at threading, it's not a really good forum for conversation. And so, I've just recently started a Discord server, which currently is only for patrons. But plausibly, by the time this podcast is actually released, I will have opened it up. And so far, that's been going really well. People are having a lot of cool AI safety conversations with each other and learning from each other and recommending reading materials. Some people have split off and started some projects together, and stuff like that. And that's only a hundred people so far because it's just Patreon supporters. But I hope that I can steadily expand out to a larger and larger group of people while maintaining the really high quality of community norms and conversation standards.

SPENCER: My understanding is there's an AI in there as well, right?

ROBERT: [laughs] Yeah, yeah, I mean, there's a bot. So the main thing that happens on the Discord right now is I have this bot which looks through the YouTube comments and it looks for questions people have asked (because people ask questions in response to my videos all the time and there's a lot of them and I don't have time basically to go through and give good high quality answers to each of the questions). So the bot posts them in Discord, there's a channel where, just once every sort of few hours, the bot will say, "Oh, somebody asked this question on YouTube. Here we are." And then that has been a really great source of conversation. Because the people who are there have varying levels of expertise. But they have their own thoughts of what the answer to this question might be. And so, the question sparks these discussions, and then after a little bit of back and forth, people decide on what they think the answer should be, and then they can tell the bot, "Hey, here is the answer,." and then the bot can go and post those as responses on YouTube.

SPENCER: Oh, to take the answers and put them back on YouTube.

ROBERT: And this whole thing is built into the karma system that I've built, where it's not just anyone on the Discord can just say, "Oh, yeah, send this reply," because obviously, it's posting it to YouTube in a sort of semi-official way, I want some assurances of quality. So the way it works is, you post a response, and you say the bot is called Stampy (by the way, it's Stampeders stamp collecting bot, which is a reference to an old computer video). And so, you can say, "Hey, Stampy, post this response," and it won't post it until that response has a certain number of reacts (people can place a little stamp react on the response). Once it gets enough, the response will be sent. But it's also this kind of commerce system where the value of the stamps that you put on a response is determined by your score. And the score that you have is determined by how many stamps have been put on your responses. So it's kind of a recursively defined thing. But if you write responses that people think are good enough to send, then your judgment about which responses are good enough to send are given more weight.

SPENCER: Like the Google PageRank algorithm.

ROBERT: Exactly.

SPENCER: Yeah, that Google originally used for helping them make their search engine much better than all the others back in the day. I'm sure, now, [laughs] their algorithm is probably much, much more complicated and probably uses a lot of machine learning. But back in the day, what they did is they realized that they decide which sites to show you, it's not enough just to look at what keywords are in common between your search query and the website. They created this idea of PageRank, where if a website was linked to by a lot of other high PageRank sites, that would give it a high PageRank. So it's kind of recursively defined that the more high PageRank sites that link to you, the higher your PageRank is, and when you link to a site, if you have a high PageRank, it gives them a higher PageRank, and so on. And it turns out — although this seems like maybe kind of an undefined system,because everything is determined by everything else — you can kind of solve the set of equations and actually converge, and then you actually can figure out how reliable ever say this.

ROBERT: Right. And that is exactly what I'm doing with this commerce system: everything goes into a big matrix that represents a system of linear equations, and NumPy solves that and that gives everyone their scores. Then I just have ground truth, where my judgment is automatically given a score of one. So that's how it's grounded in anything at all. And I'm not even sure if that's strictly necessary, but I wanted it to be my idea of what's a good response because it's my YouTube channel.

SPENCER: Okay, so let's jump into the meat of the subject of AI safety. So basically trying to make a safe advanced artificial intelligence, do you want to set that up for us? Why is this a problem? Why should there be people researching this?

ROBERT: Yeah, there are a few different ways that you can frame this. There's the obvious thing, which is advanced AI is pretty likely on the way, sooner or later, and it's very likely to be a huge deal. And you can make a more specific argument about why that is, but I think it stands on its own that these systems are going to get more and more powerful and they're going to be put in positions of more and more responsibility. And they're going to get more and more capable compared to human beings. And so one way or another, the specifics of these systems are going to end up having a really big impact on the future. And there's no guarantee that that impact is positive. And if it is, there's also that much variation within positive outcomes. We could have a slightly positive outcome or a very good outcome. And the thing is, there are arguments you can make that suggest that a bad outcome is the default — which I think is true, and we can get on to those — but even in the absence of those arguments, the impact is very large and we don't have good arguments that suggest that good outcomes are the default. We don't have a really strong technical story we can tell for why our current approach will result in systems that actually do what we want and act safely and sensibly and reliably produce good outcomes. And you could think of AI safety as trying to do the technical work necessary to tell a convincing story — a convincing technical argument for why our systems actually are safe — because right now, they really don't seem to be.

SPENCER: Some people might think that as you make a system more intelligent, it becomes easier to control it in a way. It's like you imagine trying to interact with an animal and it's kind of difficult if it's a wild animal because you can't communicate with it. Whereas if you're talking to a human, it's sort of a lot easier to have these productive interactions. And yet, I think, with AI systems, sort of the opposite is true that, like early on, when a system is not that intelligent, you're basically mainly trying to get it to do more than it does. But then that kind of flips over and as it gets more and more intelligent, eventually, you're trying to stop it from doing things you don't want it to do and it actually becomes harder and harder. Do you agree with that and do you want to comment on that?

ROBERT: Yeah, what it comes down to is a question of alignment. If your AI system is something sort of agent-like, that means that it has goals that it's pursuing. It has a utility function or an objective function or rewards functional goals of some kind. And it's kind of choosing its actions in order to achieve those goals. And so then you have the question of, "What are those goals? Are they the same as your goals?" So when a system is weak, it can be misaligned and it's not a huge problem. You can have a system.

SPENCER: Well, I have an example I sometimes use which is: Imagine you're driving a car at five miles an hour, but it just doesn't matter that much if you steer very well. You both have time to crack and if you happen to crash into something, probably the outcomes are not gonna be that bad, as long as it's not a human. But if you're driving a car at 150 miles an hour then you have to have amazing steering. You have to know exactly where you're going and you better not crash. And I think as these systems get more and more powerful, we have to be more and more confident that they're pointed in exactly the right direction.

ROBERT: Right, and the other thing that's an important distinction to make is between narrow AI systems and general AI systems. And when people talk about how we might achieve AI, what they're usually talking about there is artificial general intelligence. And generality is a measure of how wide a range of problems or environments or contexts or domains a system is able to act in. So if you take something like AlphaZero or AlphaGo (for simplicity) it's massively superhuman in its performance.

SPENCER: Can you explain what it is?

ROBERT: Oh, this is a DeepMind Go playing program that is far better at Go than any human being. So it has massively superhuman performance but it's narrow. Its entire world is the Go board. And its whole concept of reality is constrained to games of Go, which means that if I'm playing Go against it, we're in conflict, and I'm guaranteed to lose, but I only lose in the context of a Go game. Whereas, if the context of the agent is the world as a whole, then having something which is more capable than you, if it wants something you don't want, becomes a real problem. Because a lot of the strategies that are available with narrow AI systems — the default things we assume when narrow AI systems are misaligned are things like, "Oh, it's not doing what we wanted, alright, let's turn it off. Let's tweak it, we got the reward wrong, you know, its goal isn't what we expected it to be. Let's turn it off and change it to something else. Or let's just turn the whole thing off and give up because this isn't working," — all of those things are now within the domain within which the agent is operating. And so, all of those things become the equivalent of a Go move to the system. Something that it can reason about, and make decisions about as part of its process of trying to achieve its goals. So if you have a system (which is general) and is trying to do something that you don't want it to be trying to do, you're in trouble because it's going to try to stop you from turning it off, stop you from modifying it, and stop you from preventing it from achieving whatever it's trying to achieve. So our standard approaches stop working once you have a general system.

SPENCER: Yeah, so let's walk through an example. Do you want to give an example? Let's imagine a very intelligent AI in the future: It has some specific goal and then if someone tries to make a stop button, they can turn it off.

ROBERT: Okay. So this is one that I talked about on a computer file as an example. You've got your AGI (it's in a robot body), you want it to get you a cup of tea, you managed to define what that means, what a cup of tea is, and you want that to be on the desk in front of you. So it will then go and try to do that. But because there's a whole bunch of things that you haven't specified, there's a very good chance that in that process, it will try and do something that you don't want it to do. Like there's a process that involves a narrow plinth that it will just knock over because you didn't specify that you care about the vase not being destroyed. So when you rush over there to hit the button, it's able to reason in the same way that AlphaGo would reason about a Go move or anything else. Its objective is to get you tea, if you hit the button, it will turn off and stop, you will not get tea in that context. Therefore, this (just like any other obstacle) is something to be avoided, something to be worked around, it won't let you turn it off. Or if it's smarter, the situation actually potentially becomes worse. It then has an incentive to deceive you, it might do what you want it to do because you have this ability to turn it off. And it will continue to behave in the way that you want until it's managed to get itself into a position where you're not able to turn it off. And then it will go after whatever its goal is.

SPENCER: So let's break that down: Maybe it's so smart that it realizes that if it knocks over the vase, you're going to turn it off, and then it won't be able to make your tea, which is the only thing it cares about. In this example, it's been programmed to only care about making tea so it doesn't knock over the vase because it knows that you're going to turn it off. But then maybe later when it knows you don't have the button available, it knocks over the vase because it knows you can't stop it from making you tea and knocking over the vase because you have access to the button.

ROBERT: Right. We haven't yet made the argument for existential risk from AI, which there was an opportunity to make earlier that we passed by. But there are reasons to expect that, especially agents that are maximizers, but just generally agents which are powerful optimizers. If they're misaligned, they will have a tendency to produce really bad outcomes by default (not just not what we wanted, but potentially, really things that we very strongly don't prefer). And there are a few different ways to make that argument as well. But broadly speaking, the one I like to use is that when you are making decisions in the real world, a lot of that revolves around trade-offs. Anytime you're optimizing something or thinking about trade-offs, you can get more of one thing you want, but you have to sacrifice a bit of something else, in order to get it. This is just like a function of operating in the real world, money is limited, time is limited, matter and energy are limited. Almost everything trades off against almost everything else. And that the smarter you are, the more ways you have to make these trade-offs. And so, if you make a system that only cares about making your tea, the chances are good that there's some way that for any other given thing you care about — whether that's the vase, or like the continued existence of the continent that you're on, whatever — there will probably be ways that you can sacrifice those other things to get a bit more effectiveness at getting you tea.

SPENCER: Okay, so let's maybe define what a utility maximizer is. And then I think we'll make what you just said clear regarding why there's trade-offs to be made.

ROBERT: So utility maximizer is a very simplified way of thinking about agents and how they work. A utility maximizer has a utility function, which is a function which evaluates world states or world histories and maps those onto real numbers (which are basically how desirable that is in the world.) Perhaps you've given this system that you've got a utility as one in worlds in which there's a cup of tea on his desk, and zero otherwise. Let's say it's like a very simple utility function. And then the system is looking at the actions available to it. And it's modeling the utility that would result for each of the possible courses of action. And it's going to choose the course of action that has the highest utility. Nothing in the real world actually behaves this way, but it's a simplification around which a lot of our AI systems (the kind of frameworks that we use for thinking about AI systems) often rely on.

SPENCER: And real world AI's today often does have an objective function that you're just trying to maximize, right. So I think, not unrealistic, it wouldn't be something like "Put tea on this desk," but it might be something like maximizing the score in the system or something like that.

ROBERT: Yeah. Let's ditch the tea example because I don't like it. Let's go with, instead, you've created this AI system, and you run a company. And so you just say, "Hey, this number here (which is the profits of our company, or the amount of money in this bank account, something like that), that's your utility function. Just get that number. The higher that number is, the better you're doing." And so, the system is searching through the possible actions you could take, predicting how much money they end up with — after you know a certain period of time or however you want to do it — and going with the one that has the highest number,

SPENCER: Right, that idea of just maximizing the amount of money in the bank account.

ROBERT: So the problem with that is, you don't actually just care about the money in that bank account. There are other things that you care about. You might care about continuing to be alive or you might care about the welfare and wellbeing of your family or you might care about a lot of things.

SPENCER: Not breaking the law [laughs].

ROBERT: Right, and now, if you think about what's the highest possible value that this number could have (supposing you can do anything), what's the highest number? And what does that world look like? It doesn't look good, right? If you imagine a situation where most of the world is sort of behaving approximately, as it should be, it's pretty much always possible to imagine a way that there's a little bit left of extra money, you could get into that bank account by changing something, right? There's something in the world that you could sacrifice to get a bit more money in this bank account. And so, if you really maximize this value, that's guaranteed to be totally disastrous because the system is looking for every possible way to trade things that aren't in its utility function for things that are. And so, a world that's maximally-oriented around having the largest possible value in this particular place in this particular computer is not a world that anybody really wants to live in.

SPENCER: Right. So, you know, I think there's two related issues here. One is, what strategies does it use to keep increasing that number? And then the second is, what happens when you actually maximize that number, what does that world look like? In terms of strategies it might use to increase that number, you might tell it to just increase this number and that's all I care about. But in reality, you actually have all sorts of constraints on that, like; a) you don't want it to break the law because, b) you don't want it to hurt people to make that number bigger, and c) you don't want it to just hack into the bank increase the number, you also want them to actually be more money. Let's say it thinks that the number is just kind of the number recorded by the bank. Well, if there was some way to change the number but it didn't actually make you wealthier — let's say take out a really large loan or something like that — that number gets bigger but your debts get equally bigger, you wouldn't want to do that, and so on. So, if we think about just all the different strategies for increasing that number there are actually a huge number of them that you would not want to consider. And that human, if you explain this to them, "I want you to make this number go up," they would probably immediately get that all of these other strategies are off the table but the AI would have no way of thinking that, right?

ROBERT: Yeah, that's a very good way of putting it. The problem that I have, when I try to talk about this stuff is I find it hard to give specific examples, because any specific example you give, will seem silly, right? Like, it's pretty silly that you would actually create an AGI and ask it to collect stamps, or get your cup of tea, or maximize the value of this bank account or whatever.

SPENCER: Well, what about a trading system? I feel like that's more realistic to me, you know, imagine a hedge fund creates a really complicated AI and gives it access to the kind of their trading systems and just maximizes money.

ROBERT: Yeah and that's true. But when it comes to the specifics of what the system does — I like what you did there where you laid out a whole bunch of strategies which would do well — without being the thing that we actually want the system to do, I find myself always reluctant to talk about specifics because whatever specific option you choose as your example of what the AI system would do, it's pretty likely that it wouldn't do that. It's probably not the actual optimal policy.

SPENCER: Right. But the point is just that there are a lot of strategies that are not what you want. And we don't know that it won't take one of them. Right?

ROBERT: Exactly, yeah.


SPENCER: So we talked about how they could use all these different strategies that might be ones we don't want to use. And then you brought up this point of like, "Okay, but what if it actually maximizes the thing?" What does it look like to maximize, let's say, the amount of money in that bank account? And if somehow it's actually maximizing the amount of money in that bank account, in some sense, it wants to take all the wealth in the entire world. And that seems like if it could actually achieve that, that's really scary and you probably wouldn't want it to do that even if it could. And then there's also the question of, "Can it somehow hack that goal and sit, and it kind of treats that as a goal as well? What I really want is just this number to go up, I don't actually want to be wealthy, I just want this number to go up." And now you're starting to imagine things, like it's trying to build a computer that can store the largest number possible and connect it to the system so that the bank account can have the largest number ever stored, or something really weird like that.

ROBERT: Yeah and whenever you come at one of these examples (taking the assumption of super intelligent utility maximizer), you always end up spinning into wacky sci-fi territory. And that is unavoidable as a consequence, but this does illustrate the point. It's very, very, very difficult to specify what you actually want. And even if you have a really good specification, and you managed to say, "No, I want to maximize the amount of money in this account," (which is defined in such a way that kind of accurately captures what I mean by value), you kind of throw a bunch of patches. You think about, "Oh, wait, no, with this definition, it might do this. Okay, we'll fix that by changing this. Okay, now, it would do this or fix that by changing this." This is like how people often do the design of reward functions for reinforcement learners and that kind of thing. They kind of see what it does wrong, and then fix it and see what it does wrong again, and then fix it. You're pretty much guaranteed never to get everything because if you manage to have a system which is keeping track of the 20 most important things to you and making sure that none of those are affected, the 21st most important thing to you is still fair game, and is probably going to be completely annihilated in the pursuit of this utility function. So the problem is one of specification. How do you completely specify what you want because human goals and human values are actually really complicated and not really clear to us and very difficult to explicitly define?

SPENCER: Right, you can't make a list of all the things you don't want it to do. And even if you could, that list would probably contain a bunch of very ambiguous concepts on it like, "Don't harm people," or something like that. Well, what does that even mean? It gets extremely philosophical and extremely fast, and it seems like that's probably just not the right approach.

ROBERT: Exactly. Specifying it upfront is just kind of a non-starter.

SPENCER: All right. So now I want to walk through different kinds of proposals some people have had about how to control advanced AI systems and just briefly discuss some of the kinds of challenges of implementing each strategy, just to give people a sense of how there are many ideas out there. But I want people to have a sense of a few of the ideas and some of the considerations around them.

ROBERT: Yeah. Okay. Sounds good.

SPENCER: So we talked about this idea of programming an AI to just maximize money in a bank account. And some people might think, "Well, maybe the problem is that you're telling it to kind of maximize the amount of money. And that means it kind of can get an unlimited amount of utility by just making more and more money." But what if you just bound it, you just tell it, "Well, give me at least a million dollars in the bank account. And I don't care about having more than that."

ROBERT: Okay, so if this is your idea, then the utility function is either the amount of money in the bank account, or a million (whichever is smaller).

SPENCER: Right. So it increases linearly, as you get more money until it hits a million, then it's flat, because you don't care about anything more than that, right?

ROBERT: Yeah. So now, the optimal policy is not really defined. It's going to do one of the things that results in at least a million dollars in the bank account (which is kind of an infinite range of possible behaviors). And how is it going to decide which one to go with? This doesn't give you guarantees of safety. Because all of your really, really bad outcomes are still in there. You haven't told it like, "Don't hack into the bank and don't steal money and don't take out a giant loan." Those things are all still possibilities that it could pick. But the other side of this is, we've been describing a utility maximizer as something that just sort of magically picks the thing that results in the highest value and its utility function. And in practice, you don't get to do that. Because it's computationally intractable. In the extreme, how do you compute for every possible sequence of actions that you could take exactly how much money is going to be in the bank account? So making it slightly more realistic, what you would have as an expected utility maximizer, which is something that is trying to maximize what it expects the value in the bank account to be, or what it expects the output of its utility function to be for different possible courses of action, just simply because there's uncertainty involved.

SPENCER: Which is the average value basically.

ROBERT: Exactly. And that kind of system is just as problematic if you place a cap on the available utility because there's no cap on the available certainty. If it comes up with a plan, which is very likely to result in a million dollars, that doesn't have a utility of a million because it might produce less money.

SPENCER: Right, because it takes the probability of achieving the million dollars with that plan times the million utility we get from a million dollars. So we would get a little bit less utility than the four million.

ROBERT: Exactly. If the utility function is capped at one million, then the expected utility function can never actually hit one million. It's always going to be a little bit less, because there's always some uncertainty. And so, it's kind of a maximizer again, because it's trying to maximize the probability that it can get at least a million. And again, those plans start to look pretty bad, either because it's going to try and way overshoot in order to maximize the chance that it has at least a million. If it can have a plan (that it's confident we'll get it between a billion and a trillion), then it can be very confident that it's at least a million. So it may way overshoot but you may also end up with weird situations where the thing becomes obsessive about reducing its own uncertainty. If it has uncertainty about the actual value itself, then it may end up building a whole bunch of redundant machinery to check and check again and check again and check again, to ensure that it really does have more than a million. Because there's always some degree of uncertainty left and it can gain a higher expected utility by checking it.

SPENCER: Just to reiterate — and I think some people might feel like this scenario is really unnatural — you have to really get in the mindset that this is a system that only cares about one thing, right? And if you only cared about one thing, truly then, even if you seem like you achieve that thing, you'd want to double check and triple check and quadruple check, and just basically maximize the probability of that thing as much as you can, because that is literally the only thing in the world you care about.

ROBERT: Right. And, again, we end up in a wacky sci-fi scenario because this is a thought experiment. But the point is that if these are the best possible outcomes for your system and this is the thing that you're trying to approximate with your real world AI system, the better you do at approximating this idealization, the more these kinds of things become a problem. So you have to think about what the outcome of the idealized version would be.

SPENCER: Another strategy people might come up with is to say, "Well, it's so hard to specify exactly what I want, why don't I just have the AI to predict what I want." So make the AI's goal do the thing that I would want it to do, or something along that line.

ROBERT: Yeah, this kind of thing. So we basically decided more or less that writing a program — which is effectively what writing the utility function is — is going to differentiate between outcomes you want and outcomes you don't want (or that's going to tell you how desirable a particular outcome is). That is basically a non-starter, and it's a non-starter in the same kind of way as sitting down and writing a program that looks at images and tells you what's in them. If you have a dataset with a lot of pictures of dogs and loads of pictures of cats, you could sit down and try and write a program that will do something with those pixels to tell you whether this is a dog or a cat. But in practice, that just doesn't work well at all. What people do is they use machine learning. You use your dataset, you use the fact that it's quite easy for humans to evaluate this. And that lets you build a dataset of labeled examples. And then you can train a machine learning system on that dataset. And you're using machine learning to take the fact that it's easy to evaluate a program for differentiating pictures of cats and dogs and use that to make a program which differentiates between cats and dogs.

SPENCER: Right. So they're sitting down and writing some Python code that tries to search through the pixels and say, "Well, this pixel's brown, so maybe it's suddenly more likely to be a dog or something like this," which people tried for a long time and largely failed at. You instead get lots and lots of examples of dogs and cats, have people label which ones are dogs and which ones are cats, and then you run a machine learning algorithm that tries to learn a function that maps an image into whether it's dog or cat, right?

ROBERT: Right, exactly. And so perhaps we can do the same thing for our reward function, or our utility function or whatever it is that we're using. Perhaps we can have a bunch of examples of things we want, and examples of things we don't want. And then we could train a system that would be effectively predicting how a human would react to this situation. Is there something that the human would like or not? And then we can use that as our utility functional reward function? And that's the idea.

SPENCER: Right? So what are some challenges that come up if we try to take this approach?

ROBERT: There are a bunch of them. One big problem is the issues you have when you move out of the distribution that the training happens in. Assuming that we could actually gather this dataset and it was big enough (and that's like a giant challenge that in itself probably makes the thing not practical), everything that happened in history and how people feel about it, is this broadly a good idea or a bad idea? As soon as you have a system that's acting in the world, you're now in a different distribution, because you're not just looking at images and evaluating them, you're trying to pick the image which is the most dog-like or something like that.

SPENCER: Trying to generate the most dog-like image. But that's going to be an image that you've never seen before. Right?

ROBERT: Exactly. And when image classifiers are asked to do this, what you end up with is something that looks nothing like a dog because you're now out of distribution, you're no longer constrained to the process that generated your training data, you can now have things that are outside of that, but are significantly different from that. And so the system generally won't generalize well to those examples. An area of research that really demonstrates this is the work on adversarial examples where you can take the classic example: If you take this image, which is classified as a panda with 60% probability, and then you overlay some very, very faint noise (it looks like noise to humans but it's actually very carefully chosen) that looks like static. And you end up with an image which is, to a human, visually indistinguishable from the panda image, but it's classified by the system as a gibbon with 99% probability. And there's two things that are interesting here. The first is that by shifting away from the distribution just a little bit, your system can give totally the wrong answer, which is concerning. But the other thing that's concerning is 99 point something probability of a given. So the image of the panda (this modified panda image) looks more gibbon-like probably than any actual given image that it would have seen. This is kind of related to Goodhart's Law, which is that when a measure becomes a target, it stops being a good measure, once you place pressure on a particular measure or metric, and you have something that's trying to optimize that. It's likely to come apart from the thing that you're actually trying to measure.

SPENCER: Right. So this idea of AI systems over optimizing the first single measure, like we talked about before, comes up all the time and human systems too, right? You know, a company sets revenue targets for their salespeople, which is sort of like saying, "Oh, go maximize revenue." And suddenly, the salespeople are doing lots of deals that make a bunch of revenue, but don't make much profit, and actually aren't that good for the company because they're kind of single-mindedly focused on it, you make a certain test to get into schools to try to measure some latent ability. And then suddenly, people start figuring out how to game that test or get tutors that help them do well on that particular test, but don't generalize to any other test.

ROBERT: Right, exactly. And so at the beginning of this whole thing, the test is a good way of evaluating how well a student has learned the material. But once you use the test, and you place optimization pressure on it, it stops being a good indicator of how skilled the student is at the actual material. And you end up getting people who are very good at taking tests and not necessarily very good at anything else. So you could have that same problem — if you have an AI system that has learned what humans value in this kind of way — if it's then free to optimize that (still going to extremes). And that is the place where the model that it's developed with human preferences is likely to be the least accurate.

SPENCER: Right? Because it's the least training data in those weird edge regions of space where it wants to push things.

ROBERT: Right. And if it's a powerful AI system that's able to do things, which humans can't, then we're kind of immediately way off distribution. It's not just like the area where there's the least training data is an extremely long way from any existing training data. And so, the question is, "How did it generalize? How is it extrapolating from the data it has to the data it doesn't have?" And this is something that's really, really, really difficult to do well, right? A lot of the time, when you're training a machine learning system, it's able to extrapolate really well between all of the data points that you have. But then once you start to go outside, it gets less accurate. And if you're really far outside, it can be completely crazy. Because the system's behavior in response to those examples is basically completely unconstrained by the training process and the training process never involved anything that looked anything like this. So it really could be anything and this can lead to really strange behavior.

SPENCER: How about using a strategy where you limit the amount that the AI can change the world?

ROBERT: Yeah. So there's a few different things you might mean by that. There's one approach where you try to restrict the action space of the system. So you somehow have it in a box, and you don't give it the freedom to just arbitrarily take chosen actions in the world. You constrain its output channel to be something relatively low bandwidth. And the idea is you just ask it questions.

SPENCER: You can only type to it through a terminal and it can respond to texts or something like that.

ROBERT: Exactly. And this kind of thing doesn't buy you as much safety as you might think. Because if it's able to talk to a human, it's misaligned, and it would like to have more influence in the world, then you as the operator are the thing standing between this misaligned AGI and whatever it wants. So that's something that I wouldn't call safe.

SPENCER: It reminds me of this idea in hacking, which is that no matter how technically secure a system is, a lot of times the weakest point is social weak links in the system. So it's a hacker calling up the office and saying, "Hi, this is Joe in IT. Would you mind going to this website and checking this thing for me?" And [laughs] these kinds of human systems have a lot of weak links, right?

ROBERT: Absolutely. Human beings are not secure systems. And this speaks to something that I think makes AI safety particularly interesting because you need to take a security mindset. A lot of the ways that we do things in AI are pretty approximate and they're pretty light, like this thing generally works. It usually works for the inputs that we care about. It gives good answers. And that's good enough, right? Is this software completely free of bugs? No, probably not. But we've caught all of the important ones. So it's okay, or something like that. Whereas in security, it's not enough to say, "This thing works reliably for all of the inputs we've looked at," you need some kind of confidence that it works for every possible input because you have an adversary, who is specifically looking for inputs that will break your system. And you need strong assurances of the behavior of your system under the whole range of possible inputs. And that's the kind of thing you want in an AI safety system. Because you have to assume that the agent you're building, even if you feel pretty confident that it is aligned, you ought to assume that it's not. And if it isn't, it will be looking for weaknesses, it will be looking for cracks. And so, you actually need to be really confident in everything that you're building. Because if there is a weakness, like most of the time, your system can have a bunch of cracks in it. And it's fine, because the chance of anything hitting them is tiny. Whereas in this case, you have some part of your system that's looking for cracks to hit. And so, you really want to be confident that you don't have any, or have a very strong technical argument for why the system is not looking for weaknesses.

SPENCER: Right. And I guess the argument that it's looking for weaknesses is that it's trying to achieve some goal that you give it, presumably. And by exploiting these cracks, it might actually make itself better at achieving that goal narrowly constructed.

ROBERT: Exactly. Yeah. We haven't talked about instrumental convergence at all, have we?

SPENCER: No, would you go into that?

ROBERT: Yeah, it feels relevant here. So there's a question that is often asked in this context, which is like, "Why are you assuming that it would want to do these various things?" Assuming that it doesn't want to be turned off, assuming that he doesn't want to be modified, assuming that it will try to get more resources, or get more power, get more freedom in his range of actions, and so on. And what this comes down to is an idea called "instrumental convergence" which is just that certain goals are very broadly useful when pursuing a very wide range of objectives. So it's worth differentiating between instrumental goals and terminal goals. Terminal goals are goals that you just want. So if you have an agent and it's created with a particular goal, that's its terminal goal and it doesn't have a reason to want it exactly. It's just the thing that it wants because of the way that it's built. Whereas, instrumental goals are things that you want as a way of getting to another goal eventually getting to a terminal goal. Let's say, I'm the CEO of a company, do I actually value having money? Just purely for its own sake, you're probably not right, probably I want things from that money. I want to be able to spend it or whatever I want the benefits that having that money would bring me rather than the money itself. If I could have this number in this bank computer be extremely high, but everything in the world is destroyed, then it's no good for me. So money is not the end goal, it's an instrumental goal. But in fact, money is a very good example of a convergent instrumental goal. Because most people like money, almost whatever you want to do in life, whatever you care about, even if it's something completely unselfish or seemingly unrelated to money, having money would probably help. I have a video about this where I name a few examples, a person who wants to cure cancer, a person who wants a good stamp collection, a person who wants to win a marathon. It sort of doesn't matter what you're trying to achieve, having more money is probably going to be helpful. And so, that makes money for humans a convergent instrumental goal. It means that you can predict that a random stranger is probably going to be quite happy if you give them some money. But you don't have to know what they're trying to do. You don't have to know what they're doing with that money. You just know that almost anything that humans want to have, money is helpful. And there are a few of these that apply very, very broadly that we would expect to apply to AI systems as well. One of them is avoiding being turned off. It doesn't matter what you're trying to achieve. If you're turned off, you can't achieve it. Again, they're like edge cases. But for a very broad range of goals, avoiding being turned off is going to be a convergent instrumental goal. Similarly, avoiding being modified (specifically having your goal modified). Like, if you value A and there's something that can happen that will cause you to stop valuing A and instead value B, then when you think about that thing happening, you can see that if you become a thing that values B, well, not much A is going to happen. And what you care about is A.

SPENCER: Can you give an example with children?

ROBERT: Oh, yeah, yeah, totally. So do you have kids?

SPENCER: I do not.

ROBERT: Okay. Because of the example I used in the computer for the video, I was talking to Sean here and kids, and I said, "Well, suppose there's like a magic pill you can take which will make you perfectly satisfied with life. It will make you completely satisfied if you can achieve this one goal, which is to kill your children." You would then go to prison and whatever. But you wouldn't care about that because you achieve this thing. And you would have perfect happiness and satisfaction for the rest of your life. There's a sense in which certain models of an agent would go for this, just change your goal into something that's easily satisfied and then achieve your goal. But nobody would actually go for this. Because right now, you have a particular set of values and goals, which involves protecting your children. And the modified form of you sitting happily in prison is sort of fundamentally not you, in a sense. You don't care about how happy you would be in this situation, you're not going to take any action right now, which so obviously goes against the values that you have right now. And like this is a very extreme example. But it applies in general that if your goals change, then you won't be pursuing your original goals. That means you're unlikely to achieve your original goals, which means that agents that have goals will try to preserve those goals (try and prevent them from being changed). And this is part of why we have a problem with our usual approach of developing these systems, which is to turn them on and see what they do then when they do the wrong thing, then turn them off and change them. Because by default, the system doesn't want to be turned off and doesn't want to be changed.

SPENCER: Right. So if this system is pretty dumb, it basically can't realize that you're about to turn it off, and you can't realize that you're going to change its utility function. But as the system gets smarter and smarter and smarter, it becomes more aware of things like that, or at least has the potential to be aware of that. And then, suddenly, it's smart enough to realize that you're about to turn it off or about to change its utility function. And according to its utility function, that's really bad.

ROBERT: Exactly. And you can think about it from dumb to smart, but I think it's potentially more productive to think about it from narrow to general. With these current systems, we're creating a narrow and their entire world is some small subset. If it is AlphaGo, pulling the plug on the computer that AlphaGo is running on is not a Go move, right? And being turned off is not a Go board position. And so, the narrow system is fundamentally incapable of considering the possibility of being turned off. Whereas, once you have a system which is operating in the real world and thinking about the real world, there's no fundamental difference between you modifying it or you turning it off and anything else that might happen in the world because it's reasoning about the world in general. So its generality is like the core of where this danger comes from. And then, capability just makes it more and more likely that we lose. If there was some move that AlphaGo could make that would prevent it from being turned off, then the more capable the Go player it is and the more of a problem we have, because we're less and less likely to be able to successfully do that maneuver on the Go board because it outsmarts us. So something that is a general intelligence is going to have these incentives to avoid being modified and avoid being turned off. And then the more powerful he is, the bigger an issue that is.

SPENCER: Well, my response was saying, "Well, why not just make these systems narrow? Why make more general AI if it poses danger?" And so, I just want to say something about that quickly, which is that, first of all, whether we should try it or not. There are a lot of really smart people with a lot of funding who are trying to do this right now. And second, there's strong pressure to make these more general because more general things are more useful, right? Like AI, they can only play Go, "Okay, that's kind of cool." But an AI that can go to the store and fetch you something and can do your training for you, etc. The more things an AI is capable of doing and considering, essentially, the more powerful it is and the more useful it is.

ROBERT: Yeah, so there's two things I would say to that. One of them is that, exactly, you have to consider the broader context. This is actually part of why I decided to do YouTube videos. At the beginning, I was very nervous about what if I say the wrong thing, what if I make things worse, or if I have an effect on the landscape or the conversation that's counterproductive. It's still something I think a lot about, and that I try to take advice from people who spend a lot of time thinking about this. But then, I realized that if I didn't do it, somebody else would. This is an extremely interesting topic. And so, it's not just, "Do I do it or not?" It's like, "Am I more or less careful than the person who would be doing this instead of me?" And I hope that I'm more careful. And therefore, you always take risks in this type of thing, and it feels like there's a similar thing happening with AGI development. It's like if everybody who cared about safety decided not to pursue generality at all because it's so useful. People would still pursue generality, but they would be people who don't care about safety. And that's definitely a worse scenario to be in.

SPENCER: Other than there's a slippery slope here of like, "Oh, well, I can't trust other people to do this. So I better go try to make an artificial general intelligence." And then multiple people have that line of reasoning and suddenly now you have an arms race. So I think, yeah, it's a very iffy situation.

ROBERT: Yeah, you have the unilateralist curse there.

SPENCER: Do you want to explain what that is?

ROBERT: It's this problem where when you have a lot of people who are all deciding whether or not to do a thing, then the person who ends up doing it is probably the one who's the most likely to have misjudged the situation.

SPENCER: So for example, the unilateralist curse might be: Imagine that due to climate change, there are a bunch of different governments around the world that are thinking, "Maybe we should do some kind of geoengineering to try to, let's say, modify the atmosphere to prevent warming." Imagine that maybe 20 different governments around the world consider this. And it turns out that's actually a bad idea for some reason. Maybe it actually makes things even worse, but it's not that obvious. And let's say 19 of the governments end up deciding, "Yeah, you know, all things considered, seems too risky," but one of the government's goes ahead and does it. Well, that kind of affects the whole world. And so, the idea is, the more different actors you have, all considering whether it actually is worth it, the higher probability that one of them misjudges. So if it was only one country trying to decide, maybe it would make the right decision. But if there's a hundred, one of them is probably going to go for it. And then you're gonna get someone doing this action that actually, nobody should do.

ROBERT: Yeah. And that feels like, plausibly, not pursuing AGI or pursuing it in a very slow, very careful way, is the best course of action. But it's also not available on our planet Earth, as coordinated as the one that we're on. You kind of have to assume, unfortunately, that somebody will go for this.

SPENCER: Some of the biggest challenges humanity faces, I think, have to do with our kind of inability to coordinate. Right?

ROBERT: Absolutely.

SPENCER: But if everyone could agree, "Yeah, you know what, we're just not wise enough yet to build something like artificial general intelligence. Let's just give it a cooling off period until we become a lot smarter at doing this kind of thing." And maybe that would be the best case scenario, especially if you think that it's much more likely to turn out badly than to turn out well. But can humans actually agree to that sort of thing? We don't have a great track record.


SPENCER: Another potential way to try to constrain AI is to limit the amount of influence they can have on their environment, saying, "Well, you can't change the world too much." Do you want to comment on that?

ROBERT: Yeah, there are a few kinds of approaches that work around that. So intuitively, there's a sense in which — let's go back to that tea making robot — it's trying to make you a cup of tea, it's going to knock over this vase. If you said, "Well, okay, but don't have too much of an effect on the world. I want a world where there's a cup of tea in front of me, but everything else is basically the same," then it's gonna stick around that vase, plausibly, because the vase being smashed is like a change to the world, that wasn't a necessary change to make you tea. If you set that too strictly, you can get very weird behavior. For example, there's things which are going to be changed in the kitchen. There's going to be one less tea bag, things might be in slightly different places, there might be less milk, the kettle will have been used and so on. And so, some of that is useful and some of it isn't. Previously, the system would have left the kitchen in a mess because it didn't care in the process of making you tea efficiently. It would leave a mess. Whereas if you have this thing where it's like trying to leave the world to be similar to how it was when it started, then it's going to put the container of tea back in the cupboard and close the cupboard (which seems good) but how strongly you set that can create problems where the thing Is trying too hard to reset things that are hard to reset. It is like obsessively trying to maintain the status quo in a way that's too much.

SPENCER: Maybe it knocks over glass and shatters into a thousand (1000) pieces. And now it's trying [laughs] to reassemble the glass, something like this.

ROBERT: Yeah. So it would, obviously, try to avoid smashing that glass. But then if it does, the behavior then becomes kind of hard to define, because there is no path available now that results in things being close to how they were. So, what's the best it can do? And how much do you care about that? You might want it to just sweep that up and put it in the bin, depending on timeframes, it depends on various different things, but it's not actually what you want. It's not actually what you care about, everything being exactly the same. The other thing that happened is, let's say it goes into the kitchen, and it finds that there's somebody else in there, your colleague or whatever is making some coffee. Depending on how the thing is specified, this could be a huge problem. Because this person is changing things in the kitchen, the area is becoming different from how it was when the machine was switched on. And that's unacceptable. So it has to try and stop your colleague from making coffee. Now, that's relatively easy to fix because that means instead of saying, "I want you to leave the world the same as it was when you started," you could say, "I want you to leave the world, the same as you predict it would be if you didn't do anything," You define some policy, which is like the normal policy that doesn't send any signals to any motors and the robot just sits there. And you say, "Okay, use your understanding of the world to predict what would happen if you didn't do anything, and then try and achieve this goal that I've given you," while not straying too far from that hypothetical world. So in that case, it imagines, well, the vase would still be there and wouldn't be smashed, your colleague would make the coffee in the way that she usually does. And a container of tea would be in the cupboard, and it would be closed, and the milk would be put back. And all that kind of thing. Seems pretty good. But well, there are a few different questions here. One of them is that doing nothing is not always a safe policy. Like if you have this kind of impact minimizing system and it's driving a car, then it decides, "Oh, I better make sure that the outcome of my driving is not too dissimilar from the outcome of if I did absolutely nothing at all." If it starts that decision when it's already doing 70 miles an hour on the motorway, sitting perfectly still and doing nothing at all is not actually a safe policy. You can't just say don't let it be too different from doing nothing, you need to already have a rigorously defined safe policy, that you are trying to not be too different from.

SPENCER: Even this idea of what is different it's really tricky. Because, okay, to a human, a glass being moved,, one foot on a table, it's probably not that different. Whereas a human's head being moved one foot, they're dead. And that's a huge difference. But how do we specify an AI? What kind of changes in the world are actually a big deal?

ROBERT: Yeah, this is why I'm not that enthusiastic about this as an approach in general, because understanding what changes are important is value-laden, it depends on human values. You have to understand what humans care about, in order to understand which changes are important and which ones aren't. There's no objective measure like, "Okay, you're gonna smash this thing." Firstly, do I care about that thing? Is it something which is kind of supposed to be like part of its function is to be destroyed, and therefore it's fine for it to be destroyed or not. And then, if I give the system some money and I send it off to do some errands, I don't care if it comes back with the same 10 pound note that I gave it, or a different one. Because in my values, these are all worth 10 pounds to me, and I can swap any of them with any other. Whereas if it's something that has sentimental value to me, then whether it's that specific one matters very much to me, even if it could replace it with one that's actually identical. So, change depends on human values. And then, if you have a good understanding of human values, you can probably just use that to do something cleverer than just minimize change.

SPENCER: It seems like so many of these questions, when you really dig into them, they get extremely philosophical, like, "Well, what do humans actually value? And what does it mean to like having actually done a thing?" It's just sort of fascinating, all the connections between this kind of work and philosophy as you kind of brought up at the beginning.

ROBERT: Yeah, yeah, there's one other possible problem, which is that it makes it even more important to specify everything you care about in the goal. Because anything that isn't included specifically in the objective, the system is now going to try to minimize the effect of that. So if I say that I want to have a cup of tea, but I don't say that I want to have a cup of tea because I am thirsty, and I feel like I need some caffeine and I'm a little bit cold or something like that, then it's going to try to get me a cup of tea, whilst keeping my mental state (hydration, caffeination, or whatever) as close as possible to how they would be if it had done the "Do Nothing policy."

SPENCER: Or at least satisfying to you possible.

ROBERT: Yeah, it's trying to technically get me a cup of tea. But if it can, it wants me to still be thirsty, under-caffeinated, and cold, or whatever. And that doesn't seem like what we want, because you're penalizing side effects on the environment. And it does make sense because like most side effects will be negative in the sense that generally speaking, most things in the world are approximately how they are because we want them that way, because we set them up that way. So making random changes to the world is more likely to be bad than good. So avoiding all side effects might make sense as an approach because the majority of side effects will be negative. But if you're asking your AI system to do something, presumably the thing you want it to do has positive side effects. And you don't want it to be minimizing those as well. But then differentiating between positive and negative side effects, again, requires an understanding of human values. So a lot of these things that kind of look like a way around having a really solid robust model of what humans care about end up just relying on it almost as much as just trying to figure out what human values are.

SPENCER: A number of these concerns about different strategies seem to relate to sort of an AI's lack of common sense or not reading between the lines, you know, almost like imagine you're talking to someone that just interpreted everything you say completely literally and had no kind of extrapolation beyond what you actually said. Do you have any comments on that?

ROBERT: Yeah, this is the default way that computers operate. Obviously, the ideal would be that you can express what you want to the system. And it uses its full capabilities to figure out what you actually meant, and then act according to that. And it's very difficult to create a system that is actually trying to do that. That is the challenge. So sometimes people think about this as like, you create a system that is intelligent, and then you give it a goal by saying words to it, and it does its best to do that. And then there's a very natural reaction, which is like, all of these things that you're talking about, involve the system being very stupid, but you said that it's very intelligent, so how is it misunderstanding you so badly. But when we're talking about a goal, we're really talking about code. It's not something that the agent receives from the human in English. It's code that the AI developer has written. And we don't know how to write the code which says, "Listen to what the human says, do your best to understand it using all of your information available to you and your common sense, whatever else, and then act according to that." That's the bit that we don't really know how to do.

SPENCER: You could imagine more direct attempts to do this, right? For example, the human system, the AI, takes an action and the human scores it immediately on how satisfied the human is with that action. But it seems like that, well, how do you know that it didn't take such a horrible action right away? [laughs] It's kind of too late. And second of all, you get into so much of this sort of reward-hacking stuff where it's like, "But maybe the AI is just going to trick you into thinking that it actually was good," or something like that because maybe it gets you graded higher that way or something. So, even with these kinds of indirect approaches, they still kind of open up a huge can of worms.

ROBERT: There are so many of these problems. The first one is just if someone's going to follow around your cleaning robot and evaluate everything it does at all times, then you probably should just pay that person to do the cleaning. Like there is no benefit to having an AI system if you don't have scalable supervision. Just doing direct supervision, you lose all of your efficiency gains. That becomes just intractable.

SPENCER: That also leads to kind of an idea which is could you have an AI system monitoring another AI system, is there some way to make that work?

ROBERT: Yeah. The obvious ways to do this have problems. Part of it is, let's say, you have an AI system, you're not sure if it's safe. So you make another one to look after that first one. Okay, you're not sure if the second one is safe either. Right?

SPENCER: Right. So it's either the second one has to be less general, right? So its safety comes from the fact that it's like a narrower AI, or you have to be sure that that one is safe. So let's assume though that the second one (the monitoring AI) that tries to keep the first one safe or give it a punishment if it's doing something unsafe, is narrower than the first.

ROBERT: Yeah, that's something that I also don't feel confident about. Because general systems tend to beat narrow systems. Like you fundamentally set up a conflict between a general AI that wants to do something which is potentially not aligned with human values. And a narrow AI that is an adversary to it and wants something else like it wants it not to do that dangerous thing. Usually in a conflict between a narrow system and a general system, the general system will be able to shift that conflict into a domain where the narrow system doesn't know what's going on, and then beat it.

SPENCER: Like unplug it from the wall. [laughs]

ROBERT: Yeah, exactly, exactly. That kind of thing, there's the potential for a mode of attack that the system is unable to conceive of. So the basic idea of having a system that looks after the general system and tries to keep it in line, I think doesn't have that much potential. But there are various ways that you can get multiple agents competing with each other, that might end up being more aligned. The big one right now that people are looking at is AI safety through debate (which I won't go into detail about). But the idea is if suppose you have one expert who's smarter than you, and you're asking it a question but you're not sure if you can trust it, one thing you might do is get a second expert who disagrees and let them have a debate. Possibly you can't reliably evaluate the core question because it's over your head. But you can evaluate who won the debate, like imagine that the problem is some very large thing that it would take too much time for a human to check over every step, let's say, "You've got proof." And your expert says, "This is valid proof." You then get another expert who says, "This is not valid proof," and you allow them to have a debate. In that process, the second expert will say, "I think this step of the proof is flawed." And then the other expert will have to be able to respond to that, and so on. But the point is, you can now evaluate. You don't have to now look at the entire proof, right? You can just look at the line that was questioned. And so it's honed in on a much smaller, easier problem to solve. Because if this line of the proof is wrong, then the proof is wrong. So debate is this giant thing that is like a whole bunch of different approaches and ideas. But the general idea that you can use two agents who are opposed to each other, and use that to focus human attention or human judgment on the most important aspects of the question seems pretty powerful. So there are things that involve having multiple agents acting in opposition to each other that can get you some safety. But just having one that keeps an eye on the other one or something like that isn't in one, in my opinion.

SPENCER: Well, in that example, the second expert also has an incentive to kind of find a way to explain it to you that's understandable to you. Not only can they narrow the focus to a smaller amount of information, but then they can explain that information to you in a way that you get it. Whereas maybe you have trouble like setting the mathematical proof without their help.

ROBERT: Yeah, that's true.

SPENCER: So another type of strategy and I haven't heard people talk about this too much– and I don't know if you'll have a reaction to it. Is basically trying to use AI to build better and safer AI (kind of bootstrapping your way up) like, okay, maybe you have a narrow AI, but it's a little bit general. Then you can somehow leverage that to build a somewhat more sophisticated AI. And ideally, you're kind of helping you build something safer than you could build.

ROBERT: Yeah, the thing I like about this as a general area is that it seems quite competitive. Because as we were saying before, it's no good if you have the safest possible system if it doesn't perform as well as an unsafe version because you're not going to be able to stop other people from going the unsafe route. So our best bets for getting aligned AI systems in the real world to actually be the thing that we end up building, they have to be the safest while also being the most capable. That's part of why I'm quite enthusiastic about approaches like iterated amplification (which I have a video about) because it's not just that the AI system can help you produce systems that are safer than you could really use on your own, but also more capable — or things like recursive reward modeling, neither of which we talked about, and they're probably subjects for future videos and things like that — but if you have a process whereby you can use the system under development as part of its own development team, if you can figure out a way to do that safely, then you have a good chance of producing AGI quickly of being the first while also being safe. The capability advantages it's about feeding capability advantages into greater safety, and that greater safety into greater safety, and so on, trying to do both at the same time. Those are the most promising approaches, I think, in the world that we actually find ourselves in, right?

SPENCER: Because if five teams make things with safety in mind and six don't, maybe six is the only one that matters, right? Especially if the five are kind of slowed down by the fact that they're really focused on safety.

ROBERT: Yeah, exactly. Anything that involves sacrificing some capability for safety, like, for example, a sovereign agent that can just take whatever actions it wants in the world is potentially more powerful than something that can only answer questions. Even though maybe something which only wants to answer questions is safer. In a world where there's an unsafe sovereign AI, a safe question and answer is probably not going to help you very much. And the other thing is that safety slows down the development process too much, you're going slowly and carefully. Whereas somebody who's being reckless can move faster, that's another situation that you don't want. So, I think it's really important that we can coordinate. I think it's really important that people realize that safety is really, really hard. And if you're racing to be first, then you win, but you don't win anything good. You get to be the king of the world, but the world is destroyed, it's better to lose the race. If the winner is a safe AGI, then to win the race with an unsafe AGI. It feels really important for people to understand that there is no point in racing. But it's difficult to ensure that everyone knows that.

SPENCER: To wrap up this discussion, I just want to go meta for a second. Because this whole discussion reminds me of people trying to violate a conservation law in physics, or people trying to prove some mathematical theorem that has kind of resisted proof for a hundred years where it's like you have all these ideas like, "Well, maybe we could build a perpetual motion machine like this, or maybe I could prove this mathematical theorem that nobody else has been able to prove like that." And then it feels like it's going to work. And at the last minute, it's almost like a demon intervenes. And it's like, "Oh, my proof didn't work. I failed one step," It's almost what this feels like. All these strategies of how would you make a super intelligent system safe. They seem like every single one of them has some weird caveat that makes it kind of actually not seem that promising. Do you have any reaction to that? Why does this problem resist simple solutions?

ROBERT: Yeah. I think it's really interesting the way that that's phrased. And I think that's almost the key to it, that it does seem like whenever you try and do this — trying to find a way to make a super intelligence safe — it's very difficult. But perhaps there's a way that we can make a safe super intelligence. If you start with an unsafe system, figuring out a way to contain it, or to control it seems like a non-starter, but it should be possible to make something which is safe from the beginning, which doesn't need to be contained and doesn't need to be controlled, because it's trying to do what we want it to do.

SPENCER: Just take as a case study, you know, take an altruistic human, right? We have altruistic humans around and nobody needs to protect the world from them. In some sense, if we made an AI system that was sort of in the ballpark of an altruistic human, looking at that seems pretty good. And so I guess the idea is that we know that there are generally intelligent things and yet trying to take our current approaches to AI and imagine them being much, much more intelligent than they are now, and then trying to strap on safety to them seems to kind of keep bumping up against all kinds of weird considerations.

ROBERT: Yeah, that we might need a fairly fundamental change in approach.

SPENCER: Awesome. Robert, thanks so much for coming on and this is really interesting.

ROBERT: Thanks for having me.





Click here to return to the list of all episodes.


Sign up to receive one helpful idea and one brand-new podcast episode each week!

Contact Us

We'd love to hear from you! To give us your feedback on the podcast, or to tell us about how the ideas from the podcast have impacted you, send us an email at:

Or connect with us on social media: