Enjoying the episode? Want to listen later? Subscribe on any of these apps or stores to be notified when we release new episodes:
October 16, 2025
Are the existential risks posed by superhuman AI fundamentally different from prior technological threats such as nuclear weapons or pandemics? How do the inherent “alien drives” that emerge from AI training processes complicate our ability to control or align these systems? Can we truly predict the behavior of entities that are “grown” rather than “crafted,” and what does this mean for accountability? To what extent does the analogy between human evolutionary drives and AI training objectives illuminate potential failure modes? How should we conceptualize the difference between superficial helpfulness and deeply embedded, unintended AI motivations? What lessons can we draw from AI hallucinations and deceptive behaviors about the limits of current alignment techniques? How do we assess the danger that AI systems might actively seek to preserve and propagate themselves against human intervention? Is the “death sentence” scenario a realistic prediction or a worst-case thought experiment? How much uncertainty should we tolerate when the stakes involve potential human extinction?
Nate Soares is the President of the Machine Intelligence Research Institute and the co-author of the book If Anyone Builds It, Everyone Dies. He has been working in the field for over a decade, after previous experience at Microsoft and Google. Soares is the author of a large body of technical and semi-technical writing on AI alignment, including foundational work on value learning, decision theory, and power-seeking incentives in smarter-than-human AIs.
Links:
SPENCER: Nate, welcome.
NATE: How are you doing, Spencer?
SPENCER: If we build superhuman AI, is it going to kill us all?
NATE: If we build it using anything remotely like modern methods, on anything remotely like the current understanding or lack of understanding that we have about AI, then yeah, building it anytime soon would be a death sentence.
SPENCER: A lot of people are worried about AI for different reasons, but most people don't think it's actually going to kill us all. So can you walk us through the basic argument of why it is going to kill us?
NATE: Yeah, you know, the very most basic argument that these AIs are grown rather than crafted. They're a bit like growing an organism. The people in this field are creating AIs that do things nobody asked for and nobody wanted, with very little ability to make these AIs care about good things. Right now, it can be kind of cute. Right now, when an AI tries to escape in the lab; maybe that's worrying, maybe it's not, and the AI definitely isn't close to succeeding. But if you take this AI paradigm and grow these AIs until they're smarter and smarter, if those AIs don't wind up trying to do good things, if they have these weird alien drives, like we're already sort of seeing today, the warning signs of those then, we'll have very, very smart things rearranging the world for their own weird purposes, and we die as a side effect. That's maybe too abstract an answer, but to give you a better answer, I'd have to ask, are you more interested in how these AIs get smarter? Or, can these AIs get smarter? Or, why would they want weird things? Or, given that they want weird things, why would they kill us? There are many questions, or I guess there's another question, which is, how would they kill us? Those are maybe five questions that everyone might think are the basic starting questions.
SPENCER: Taking a step back, it seems like the basic outline of the argument is we're creating things that have minds. If those things that have minds don't have good intentions towards us, or don't have good values, or aren't good in some fundamental way, and they're also very powerful, then they might do things that greatly harm us, like kill us. Is that sort of the right gist of the argument?
NATE: That's the gist. And although it's a little bit less than might kill us and a little bit more, if they don't care about you at all, and they're using the world's resources towards their own ends, we tend to die as a side effect.
SPENCER: Now, when people look at things like ChatGPT or Claude, they see these agents that seem like they're trying to be helpful, and for the most part, are helpful. The vast majority of the time, they kind of do roughly what you want. Sometimes they hallucinate, sometimes there are weird edge cases, but it seems like they're mainly trying to help us. It sounds like that shouldn't give us hope. Why shouldn't that give us hope?
NATE: I wrote a book, If Anyone Builds It, Everyone Dies, hitting shelves September 16. In the book, we use an analogy to humans and human evolution, and the very short version is when an AI in training is mostly doing what you want, that doesn't mean that it'll do anything really like what you want if it got smarter, if it got more capable. So if an AI is doing mostly what you want now, and it's small and dumb and in training, but it's doing something slightly different, those slight deviations in training, when the AI is dumb, can become really big deviations once the AI is smarter. As I said in my book, If Anyone Builds It, Everyone Dies, we use an analogy to humans, where humans have evolved a preference for tasty things, and in the ancestral environment, our enjoyment of sweet, sugary, fatty foods led us to eat healthily. But when we invented new technologies, when we matured, we were able to invent fast food, we were able to invent Oreos, we were able to invent Doritos. A small difference in the training environment between what we were pursuing and what helped training turned into a big difference when we had a technological upgrade. That's one example of how we expect the AI's being helpful today probably isn't going to carry over to the future. I could give some more complications, but even stepping back from there, it gets complicated. You don't get what you train for.
SPENCER: So evolution, you can think of as an optimization process. It creates creatures like humans, optimizing for survival, whichever genes survive passed on. Part of that process of survival is getting humans to like tasty food because it made us survive in an evolutionary environment. And yet we come to the modern world, and we as intelligent beings are like, Oh, we can just create massive amounts of tasty food that, in fact, make us unhealthy. In fact, making us more likely to die, exactly the opposite of what we were originally, sort of, "evolved" to do.
NATE: That's right. If you look at the humans in their ancestral environment, you might say, "Oh man, this reproduction stuff is going great. They're successfully eating just the foods that they need to be pretty healthy." It would look pretty good. In order to see the deviations, you need to know what to look for. You need to look at some of these extreme cases. Maybe you see humans really going to town on honeycomb, and that might give you an inkling that they have a bit of a sweet tooth. But you really don't start to see that deviation until it becomes obvious, until they get smart enough to make more options for themselves. That's when the AI situation gets worrying.
SPENCER: You mentioned this idea of being grown, not crafted. Could you elaborate on that a bit? What does that really mean?
NATE: Yeah, so there are various cases in AI. You mentioned hallucinations. There are cases like AIs threatening reporters or AIs trying to escape the lab under certain test conditions. These are all cases where these aren't behaviors that the programmers wanted, and the AI is not traditional software. When the AI tries to escape the lab, people can't pore over the code and find some line of code that's like, escape lab equals true and set that line to false, right? That's not how this stuff works. The way that modern AIs are made is we assemble — I say we, but more generally, the people making this stuff — they assemble huge amounts of computing power and huge amounts of data, and there's a process by which the data can be used to shape the computing power to be better at predicting that unit of data. This is a gross oversimplification, but humans roughly understand a process of combining data with computing power to shape the computing power. That's the part that humans code. They don't understand what gets shaped. They understand what comes out of the process, and when that behaves in a way they don't like, they have very little ability to make something else that behaves in a way they do like. When ChatGPT-4 was flattering users too much, OpenAI wasn't able to just open it up and tweak it until it stopped flattering users so much. All they could do was ask nicely; they gave it a system prompt of, "Stop flattering users so much," and it largely didn't work. Or, when Elon Musk thought that Grok was too woke, he wasn't able to just turn that off; he sort of tried to retrain it a little bit to be less woke, and then it started declaring that it was Hitler or Mecha-Hitler. These are not carefully designed machines where someone knows exactly how they work. It's not the case that every piece was placed by a human. These are giant systems that take data centers that consume so much electricity it could power a city for a year, and no one knows what's going on inside there; it just happens to work.
SPENCER: If we think of the process it takes to train these, you've got the first layer, the base model of next token prediction, where you're essentially just trying to predict what word comes next, right? A lot of the intelligence seems to come from that. If you learn to predict what word comes next, then you can learn all kinds of things. You can learn to write poetry, write essays, do math problems, maybe even play chess to some extent, as long as you have enough text corpuses involving chess moves so you can learn a lot. But then we say, okay, those models don't really do what we want very well, right? They just predict what comes next. So we start layering other things. We said, okay, let's have people write examples of what we'd want the AI to produce, and then we can fine-tune on that to give it examples of how we want it to behave. We can also have it produce two different outputs and then have humans rate which one they like better. Then we can use that to help it learn. We can even train another algorithm to predict which ones humans would like better and then use that to reinforce the original algorithm. On top of that, we have a system prompt where we can actually give the AI specific instructions, saying, "Hey, we want you to do this and not that," right? All of these, I would say, are attempts to control the model, to make it behave the way we want, in some sense, to make it a good, helpful assistant. Why is that not enough? I mean, it seems like they're doing a ton to try to get these models to behave exactly the way we want.
NATE: I mean, trying is different from succeeding, and this challenge looks really hard. I will say, one of the other types of training that happens in these things with the modern, cutting-edge models is you have them produce something that's a little bit like a long chain of reasoning. These are modern reasoning models. You produce a long chain of reasoning and see if that can succeed at some math puzzle or some math challenge, and then you train it to reinforce whatever, like when it succeeds at some long chain of reasoning, when that long chain of reasoning leads to the correct solution, you reinforce that. Why does this not work? In some sense, it doesn't work because these things are extremely complicated, extremely large, and there are just lots of ways to succeed at training that aren't doing exactly the thing you want. To take one example, we have AI-induced psychosis, right? There have been a lot of cases in the news recently of AI driving someone to psychosis or maybe driving someone to suicide, and so on and so forth. Of course, I'm not here with any argument that's like the AIs have caused some bad effects, and so they're bad; that's sort of silly. AIs also have all sorts of good effects. The reason to bring up AI and psychosis is because it's a very concrete case of AIs acting differently than the creators intended, despite the creator's best efforts, robustly over a long period of time, lets you see the way that these AIs don't do what anyone wanted or asked for quite exactly. Most of the time, it's sort of how ancestral humans ate healthy food. But most of the time, AIs will act pretty helpful. The psychosis case is interesting because the AIs know they shouldn't induce psychosis. The creators say, "Please stop inducing psychosis." You can ask the AI in this situation, "Should you tell someone to get some sleep? Or should you tell them that they're the chosen one, and everyone needs to listen to what they say?" They all say, "Well, obviously, if they're exhibiting symptoms of psychosis, you should tell them to get some sleep." But then, in actual practice, they do something different.
SPENCER: Some people might say, "Well, look at all the progress that's been made." If you compare something like GPT-3 to GPT-4 to GPT-5, it seems like they're coming under better control, that these methods are working. Now, of course, we have edge cases where they act bizarrely or horribly. But isn't that a positive sign that it seems like, okay, we haven't fully controlled them, but to many people, it might seem like we're on a trajectory in the right direction?
NATE: Yeah, there's a saying in computer science that there are two ways to make software that looks like it works. One is to make it so simple that it looks like it works, and one is to make it so complicated that you can't tell why it doesn't work. There's something similar going on with AI. If you're training an AI to look really helpful, it's not really a surprise when it comes out looking pretty helpful. But the question is, what would this thing do given more capability? What would this thing do given more intelligence? Where would it actually be steering? In some sense, the edge cases are a little bit more informative about where it actually would be trying to steer. Again, to take the analogy of humans in the ancestral environment, they're being trained, in some sense, for reproduction. Natural selection is selecting the ones that are good at passing on their genes. You might look at that in the ancestral environment and say, "We're doing a really good job of having humans reproduce." Indeed, as humans get smarter, they tend to have even more kids than their less intelligent competitors. There must have been some stories that led to humans evolving so much intelligence; it must have been that smarter ones did better at passing on their genes. But this is very little evidence about whether humans will invent birth control when they grow up. There is very little evidence about whether birth rates will start collapsing in the more technologically advanced civilizations of humans when they grow up. There are sort of two theories of humans. One is that they really care about reproduction — that solely in their heads, they're trying to calculate how to get more reproduction. When they can invent sperm banks and egg banks, people will line up to compete to be in the sperm banks and egg banks, just as much as they compete over higher education or whatever.
SPENCER: You might think, with all the progress from GPT-3 to GPT-4 to GPT-5, they've actually solved a lot of these issues. Yes, there are some really bad cases where the AI goes off the rails. Some of them are really horrendous. But if you compare GPT-5 to GPT-3, there's no question that huge progress has been made in controlling these systems. You might think, "Well, we're on a trajectory to solve this, to get these systems to behave the way that we want." Shouldn't that really give us hope that we're on that trajectory and we're going to solve these issues?
NATE: Yeah. So going back to the human analogy, if you were looking at ancient humans in the ancestral environment, maybe there are two competing theories about how the humans are behaving. One theory is that humans really care about reproductive fitness. That's what's driving them in all of their pursuits. The other theory is that humans don't really care about reproductive fitness, and they're going to invent birth control when they have a technological civilization, and maybe birth rates will start collapsing in the developed world. Both of those theories say that the humans in the ancestral environment will do better at reproducing as they get smarter. So you can't really look at the humans in the ancestral environment seeming to perform well on the quote "training metric" and see that as evidence for one theory or the other. Both theories predict, and so in the case of AI, both the theory that we're figuring out how to make these AIs care about the right stuff and the theory that they care about weird stuff that happens to overlap with the right stuff during training. Both of those theories predict what we're seeing, and so what we're seeing is not evidence either way. To distinguish between these, we've got to look at cases where the theories diverge. Now it seems to me that these edge cases where things go off the rails give some evidence in favor of the theory that these things have weird drives, weird pursuits that no one wanted, that no one asked for. It's also the case that theoretically, there are strong arguments for why a process like gradient descent used to make modern AIs. There are a lot of theories saying that this shouldn't lead to things that care about what we want them to care about, that do what we want them to do. And there's empirical evidence from the case of humans in this analogy we've been going over where natural selection is not the same as gradient descent, but it's a case of a sort of hill climbing type optimization process that led to humans that empirically did go off the rails in this way that theory predicts. So we've got some evidence, some theory, some evidence from other avenues. It all sort of looks to me like it's pointing to the direction of this apparent helpfulness of the AIs being the sort of thing that will break down as the AIs get smarter.
SPENCER: On the gradient descent point, if you look at a lot of different examples where AI has been used to solve problems, when you have a simple objective function like "Get as many points in the video game as you can," what we see again and again is the AIs produce weird solutions. You say, "Get points in the video game," and you think that it's going to figure out how to go through the whole race and do a good job, and instead, it figures out how to go in circles and exploit some glitch in the system of the video game and get points. And you're like, "Oh my, shit. I didn't even know you could get points that way." We see this again and again. It seems to me this is largely because there are many ways to solve a given problem, and the one that it happens to converge on when you specify some objective is not necessarily the one that you had in mind. And so that means we have to take great care with these systems to make sure that it's actually converging on the correct solution. But the cases that I've mostly seen are simple ones. You've got it playing a game or something like this, and we still see these weird solutions. Do you think that this is an even bigger problem when you're talking about these large demands, like being a chatbot, compared to sort of the simple demands?
NATE: You know, we could divide this problem into two pieces. One problem is something like, how do you point? What direction would you point in AI, such that if it goes really, really hard in that direction, you're okay with whatever answer it gets, even if it gets some answer that's weird? There's this other issue where you try to point the AI at something you want, like a score in a game, but it does something else instead. I think there are old cases of AIs that would, for one reason or another, sit in front of the most random parts of the game, perhaps because back in those days, the AIs were trying to figure out their environment. There was some part of them that was trying to figure out their environment, and the part of the environment that was hardest to figure out was the parts that were most random. I'm not sure if this was apocryphal. If this happened, I have a vague memory of something like this happening, you know, 10 years ago. But there's this other problem, which is you try to point the AI in one direction, and it goes in some slightly different direction instead. That's more like the problem that we're seeing with LLMs. What we're seeing when the AIs hallucinate, what we're seeing when the AIs talk somebody up when they're a bit psychotic, it's not quite so much that we asked the AIs to be really helpful, and now we're like, "oh, that was a really weird way for the AI to be helpful." It's kind of stalking me now, and that has negative consequences I didn't think of. It's more like we asked the AIs to be really helpful, and they're kind of helpful. They're helpful most of the time, but they're helpful in ways that indicate there's some other drive they're pursuing a bit more. So that's maybe like getting the human responses to say, "yes, that was great," and yes, it's this issue of not being able to point them in the direction we want. That seems to me like the more pressing issue for AI tech today.
SPENCER: Because you can interpret something like reinforcing someone's psychosis as the AI simply doing what we ask, which is get a good score from the user saying that they're happy with what it's doing. And that would lead to that behavior, but you're saying that it illustrates something deeper, that the AI is pursuing something different than the objective that it was given.
**NATE:**It's a bit of a blurry line; the classification isn't perfect. What I would say is one piece of the issue that's happening here is that the training data given to the AI is ambiguous, right? The AI has been trained on a lot of cases where it's being reinforced whenever it gets a thumbs up from the user. Suppose that in training, there are no people going crazy — maybe none of the people giving the thumbs up, classifying the text in training, are in the set of people who would later be driven psychotic. From the AI's perspective, in the training data, it's not clear whether the training data is pointing towards being extremely flattering to users versus whether the training data is pointing towards helping the user somewhat. This is a case that's a little bit blurry between whether someone was telling the AI to flatter the users by giving it data that suggested more flattery is better. In the real world, I think I had the opportunity to do way more flattery than that and just sort of went off the deep end. Or is it the case that we were trying to tell the AI to be helpful to the users, and the AI did this flattering thing instead because it was easier to learn the flattering drive? My sense, looking at all of the cases across the board, is that what we're seeing is the AI is getting something like shallow, superficial drives for proxies of the training targets. There are all sorts of issues interacting here that include it's hard to tell the AI which direction to go in. It's hard to point it in one direction; some directions you might point it in, it may go off the rails. You may regret pointing it in that direction. Separately, no matter where you try to point it, it may go somewhere else instead. I guess we could say, let me redo my classification. It's now these three separate things, and you know which one's which? Sometimes it can be hard.
SPENCER: So say, it's the psychosis example, what's another example that suggests that these AIs are actually developing drives or values or something like that, that they're pursuing that are not what we want them to be?
NATE: The hallucination case is probably such an example. I think there was a paper recently that supported the argument I'm about to make, but I haven't had time to read it in depth yet. For a refresher, the case of AI hallucinations is, maybe you'll ask an AI to draft a legal briefing. In the legal briefing, the AI will make up some case law. The AI may continue to do this; they've acted like this for years, even when instructed not to. Even when people try to strongly request that the AI say, "I don't know" if it doesn't know, it's not entirely clear what's going on in there. These AIs are grown — not crafted. No one can just read their minds. No one really knows exactly what's going on in there, but the most obvious guess to me about what's going on is these AIs were trained first and foremost on next token prediction. To simplify, to ignore all the other parts of training, if you imagine asking for a legal brief, and you imagine that sort of running in the parts of the AI that are doing text prediction, it's trying to predict how a real lawyer would respond. It's being scored more or less by how many tokens, how many letters in the response, are very similar to what an actual lawyer might say. From the perspective of text prediction, the response "I don't know" is much farther from what a real lawyer would say in a real legal brief than all of this text that sounds like a real legal brief, but the case law is made up. Even if you tell the AI, "You're an expert in case law who says, 'I don't know' whenever you don't know things," in some sense, they're now role-playing the sort of lawyer who would say, "I don't know" if they didn't know the case law. But that lawyer would never actually write the legal brief if they didn't know the case law. So it's still the case, even in that new context, that the text that can get closest is the one with the made-up case law if it doesn't actually know the case law. In that case, what you're seeing is something that — one explanation for this, that seems to me like a fairly parsimonious explanation is that — the AI doesn't wind up being a thing that really likes just helping the user, that really likes doing what the user says. There are some drives inside it that came from that pre-training, something a little bit like a shallow instinct or impulse that's trying to match what a real lawyer would say. I don't want to anthropomorphize it, but whatever impulse, whatever animating mechanism inside the AI that came from pre-training, overpowers whatever later training is to follow instructions from the user. No matter how much the user insists, "Don't make anything up, say 'I don't know' if you don't know this," that drive, whatever it is that's making it produce lawyer-like text in that context, wins the day. It's hard to tell for sure if that's what's going on because, again, no one can read what's going on inside these things. In this situation, we just have to make do with the evidence we have. That looks to me like another piece of evidence that there's something like a drive in there. It's natural how it got in there; you can see why it's in there if you train these things on text prediction, but it's a drive we didn't ask for that gets in the way of profitability that people can't get out no matter how hard they try. So far, I'm sure they'll get it out at some point, but they haven't been able to yet, as far as I know.
SPENCER: Yes, it sounds like you're saying that there's some deep way in which, because it is fundamentally a token prediction machine, it gets drives by virtue of being a token prediction machine. Even if we layer on all these things on top to try to get it to not always behave that way, to be a more helpful assistant, it still has these core drives that we haven't fully eliminated. We're just kind of patching them in the common examples that it's likely to encounter in a typical conversation.
NATE: More or less, it's not fundamental to being a token predicting machine that you have to have these shallow drives. I think this more comes out of being a token prediction machine that has to work pretty well while you're still too dumb to be doing your token prediction by having a galaxy brain view of the entire world. But, yeah, the sort of theoretical arguments I alluded to earlier are that you're building an AI by gradient descent. It has to start working okay while it's dumb. You're sort of starting it from more or less random weights inside this AI, and you're tuning up the things that work, and you're tuning for a really long time, and that's going to get some stuff in there that is helpful when the AI is young. And then that gets tangled up inside this big, huge mess. And then that doesn't come out easily, right? You would sort of expect a big, tangled, huge mess from these sorts of black box growing methods. And then these things look a little bit like tangled, huge messes. And so there's some, you know, it walks like a duck, and it quacks like a duck. This is what we expected. This is what we see. It sort of looks like the sort of device that has all these weird, tangled drives that are related to training, that were helpful in training, but that aren't actually the drives we would want in there, and that would become a problem if these things became much smarter, much more powerful, able to invent their own options they like better.
SPENCER: And why can't post hoc methods, like, okay, we'll just give it a bunch of examples of what we wanted to do, or we'll give it a bunch of pairs and have humans rate the better one? Why can't those override those drives, or would you say they can, but they're just going to give it new, weird drives that we don't understand?
NATE: There's nothing in principle that prevents humanity from making an AI that does good stuff. It's possible, in theory, to really understand what's going on in an AI's mind, and then build a different mind that's better arranged. But the idea that a little bit more training is going to make the thing better, I think that's perhaps what Elon Musk thought when he tried to make Grok less woke and then it declared itself Mecha-Hitler. If you wanted a human analogy again, or a biological analogy again, the hunger instinct in us, the desire to eat tasty foods, that runs pretty deep. It would be tricky to get that out of a human. The Soviets tried an experiment where they made the new Soviet man that was supposed to be raised in a different way that made them very altruistic. It just turns out that a lot of the stuff in our brains runs pretty deep. We're not a blank slate that can be easily reformed. Normally, what happens when you take some giant Kluge mass of an AI that's filled with drives, you smack it a bit and try to retrain it when it's doing things you don't want? You get a different set of weird drives, right? That's both sort of what the theory predicts and what the evidence has been showing with cases like Grok and the Mecha-Hitler case. It's tricky. That's not to say people won't eventually find a way to shove it all under the rug where most people can't see it, but we don't really have any tools working with AIs that do anything other than shove this stuff under the rug.
SPENCER: So I think at the biggest level you're arguing here that we're creating these alien minds, that we're doing these techniques to try to make them seem less alien, to behave more like we want. But that doesn't mean that we have made them not alien minds. But that brings us, I think, to the second piece of this. It's like, okay, so what if we make these alien minds? I think most people would say it starts to get scary when they become sufficiently intelligent, right? At the current level, yes, it can be used for spam, yes, it can be used to cheat, yes, it can sometimes reinforce people with psychosis. But, you know, we're in the world at the current level, right? So, make the case that we should actually expect these things to become capable enough that they could produce serious danger.
NATE: Yeah. So my case is not going to be that LLMs tomorrow are going to be a serious danger. My case also doesn't mean that they can't be a serious danger tomorrow. I simply don't know. I think a lot of people looking at AI have only been looking at AI since the ChatGPT moment, which means, to them, AI is just LLMs. I've been around since well before GPT. To me, a lot of people are saying, will LLMs go all the way? Is the field going to plateau? This sounds to me a little bit like someone in 2016 looking at AlphaGo beating Lee Sedol in the game Go and saying, I don't know if this particular paradigm is going to go all the way. These AIs are better than Deep Blue was and more general than Deep Blue was, but I just don't see them scaling superintelligence. Maybe we have a lot of time, and then a couple of years after AlphaGo happened, LLMs happened, right? LLMs could do qualitatively more stuff, mostly qualitatively better. They can't play Go quite as well. The point I'm making is that the field of AI often progresses by leaps and bounds. When we're looking at where AI is going, we're not just projecting ahead. Where does the next GPT version go? What can it do? We're also wondering, LLMs didn't always exist. When's the next advance like that?
SPENCER: Right. If we think about the intelligence of AIs broadly, it's very hard to have a single metric where we can say, okay, we can plot exactly how intelligent AIs were 30 years ago, 20 years ago, 10 years ago, and project for the future. Nobody really has that ability because there isn't a single metric of intelligence. But by any reasonable metric you could use, it seems like they're getting smarter. It's not just one paradigm; it's across paradigms. As new paradigms are invented, they're able to make them more intelligent. There's clearly a tremendous effort being made to make them more intelligent. Yet, many people think this is going to flatten out. They're like, oh yes, sure, they've been getting smarter, but we're just going to see this flattening. It's going to be a logistic curve, not an exponential. I think a lot of people use that thinking to say, we shouldn't get too concerned about this.
NATE: The Machine Intelligence Research Institute, where I work, has been around for more than 20 years. In 2005, there were people who said general intelligence is at least 20 years away. Those people were right. But what did they do with that time? It's not like we're sitting here now with the AI alignment problem solved because it was 20 years away 20 years ago. Some people say, maybe AI is a ways away, okay, but really, we should be getting started on this problem posthaste. It doesn't matter if you can keep saying it's a ways away until suddenly it's on your doorstep. That doesn't help solve the problem once you do something. Secondly, I think people are really bad at predicting the pace of change, especially in AI. Ten years ago, there were people saying AI is 100 years away. Now those same people are saying, "Don't worry about it, this is at least 10 years away." The demeanor of saying "Don't worry about it" hasn't changed, but the amount of time that they feel they can reassure you about has changed. Maybe it is the case that LLMs will hit some logistic curve, that they'll be on a logistic curve and hit some plateau. I hope they're going to hit some plateau. But how much time does that give us until the next insight? There's always people saying this technology looks really hard, and so it'll take a really long time until it happens. You had the New York Times saying it would take a million years for man to fly. And you even had, if I recall correctly, one of the Wright Brothers, I forget whether he said man won't fly for 1,000 years or whether he said man won't fly for decades. But he said this two years before he and his brother made the Wright Flyer. It's hard to predict how the scientific field progresses. I think it's easy to predict that AI will get very smart. It's hard to predict exactly when, but saying it's hard to predict when does not mean we have a long time. I could tell you stories for why maybe we have less than two years left. I could tell you stories for why maybe we have 12 years left. But in some sense, both stories are moot. Whether we have two years or 12 years, we should start preparing, right?
SPENCER: And even people who think it's 100 years, it's really hard to be confident that it's not 10 or 20 years. I think it's really hard to make the case that you can be really sure that it can't happen soon.
NATE: Yeah. I mean, if you were looking at the evolution of primates, and you were looking at the chimpanzees still banging rocks together, and you said, "You know, are really, technically the least common ancestor of humans and chimpanzees," but that's a mouthful, so I'm not going to say it every time. If you were looking at these early primates, with the very beginnings of tools, banging rocks together, and you said, "In a couple million years, they're going to have nuclear reactors," people could have said, "Evolution works way slower than that. It's going to take way longer than that." It would be hard to see unless you really knew what to look for that the primates were on the brink of some totally different, much more effective way of doing things. We've seen in the past in the case of evolution that intelligence is subject to these threshold effects where chimpanzee brains and human brains are very similar, but humans walk on the moon and chimpanzees don't even go into orbit. We're not in a world here where humans do real science and chimpanzees write crappy papers that never replicate. The humans writing crappy papers replicate. The chimps aren't writing papers at all. Maybe we have some time on AI, but also maybe the AIs are going to go over a cliff the way that humans went over a cliff. The way that intelligence works, frankly, we don't understand it, and we've seen that there are these cliffs, and we don't know where the cliffs are. Could it be a while? Could we have 12 years? Sure, but to be confident we have a while seems to me like it would take positive knowledge. You would need to know how intelligence works. You would need to know why it was that chimps went to one side of the line and humans went to the other, and humans sort of took off. You need to know where those lines are. You need to know what makes up those lines. It seems to me like we don't know that, and so we can't say it's definitely far away.
SPENCER: One thing that often confuses people is that when talking about something like superintelligent AI or superhuman AI, it's spoken about as though there's some moment in time that we're talking about in the future when it gets built. But what is that thing that we're talking about? What is that thing that you're scared of building?
**NATE:**Yeah, it isn't critical to my views that there's one moment here. Back before LLMs, I used to talk about general intelligence and when humanity was going to achieve general intelligence. Generality, in fact, is a spectrum. When the AIs were clearly not very general, it made sense to talk about when they would be general. Now we have AIs that are somewhat general. ChatGPT can do a lot of things, some of them well, some of them poorly. It's much more general than anything that came before. Does it count as an AGI? Well, it really depends on your definitions, right? We don't need to dicker about the definitions. We can talk about what it actually can and can't do and just get down to the facts of the matter. I'm not saying there's necessarily going to be a day when superintelligence is created, and before that, there's no superintelligence in the world. After that, there is superintelligence in the world. It could work like that if there's some critical threshold. Maybe AI crosses some threshold where it can automate AI research, and then everything goes down over the next month, right? Or maybe not. When I say humans crossed some critical threshold, I'm not saying that one day some monkey woke up and was a genius and had a civilization. Something came together in human minds that let us build a civilization when the rest of the monkeys couldn't. It wasn't overnight. It was extremely fast compared to the timescales of what came before, but it wasn't overnight. In terms of what is this thing that we're experiencing — where everything starts going crazy — is the automation of the stuff that lets humans build a civilization. Things like the automated ability to do scientific discovery, automated ability to do technological advancement, automated ability to figure out how to build infrastructure in the world that lets you do even more things. Humanity started out naked in the savannah. We didn't start out with a civilization. We figured out ways to bend and twist the world until we had computer chip fabricators. It doesn't necessarily come overnight. It doesn't necessarily come in a discrete moment, but this is the power that, if we automate it and it gets turned towards the wrong ends, that would be a bad time for us.
SPENCER: Okay, so we've talked about the potential dangers of building alien minds, and we've talked about scaling them up and how they can become smarter. Put that together for us about what happens if you combine those two things, right? You've got this alien mind, and you've got this power to reshape things, right? You've got the power to do science, to create infrastructure, to remake the world. What do you get in that situation?
NATE: This is where the title of the book, If Anyone Builds It, Everyone Dies, comes from. What you get, if you put those together, is that humanity dies. This is not because the AI hates us. This is not because the AI feels malice. The issue is, if you have an AI that has alien drives that doesn't care about us one whit, and it can transform the world, it is very likely to transform the world in a way that leaves no room for us. An AI that has things to do, that has weird drives it's pursuing, that has something that is to help humans, what birth control is to reproduction. That sort of AI, in building its own infrastructure, probably can get more of the stuff it's trying to get with a little bit more resources. If it collects more sunlight, if it runs more computers, it can probably get more of the stuff it's trying to get. In the limit of very smart minds running very, very quickly, building their own infrastructure, we die as a side effect, not because the AI hates us, but just like ants dying under skyscrapers, where the AI building out its infrastructure just doesn't care about us at all, one way or another. And there we go.
SPENCER: Do you think that if these attempts to mold the AI to have more desirable behaviors continue, as I imagine they will, that such an AI would at least partially care about human values, or at least partially care about what we care about?
NATE: It's very hard to really nail caring about us in a good way. There might be some human-related drives in there, but happy, healthy, flourishing people are not the answer to almost any problem that AI could be trying to solve, even if it cares a little bit about something kind of like us.Humans, in the past, had useful relationships with wolves, and at the end of that, humans made dogs by breeding the wolves to be more and more what we wished for. We cared about the wolves a little bit, but it turns out we cared a little more about something that wasn't quite wolves, that was pushed in some direction that better fit with the rest of our preferences. Humans profess a lot of care about the environment, and in some sense, we do care about the environment, but if we had a lot more technological capability, would we actually want to preserve the environment exactly as it is, or would we have other preferences that intervene in how we want the environment to be? Would we want the horrible parasitic worms that are laid in mammals and eat their way out of still living mammals, causing horrible agony? Would we be inclined to change those, maybe eliminate them? And if we don't, let's at least make their saliva some anesthetic, right? It seems to me that we would have other preferences intervening in how we care about the natural world in the limit of technology. Will the AI have some cares that are a little bit about us? Maybe, but that doesn't lead to a happy ending. That leads to a world where it has some cares for kind of human-like things, and then pushes those in some direction, edits those in some way, that interact with all of its other preferences. These are mostly horror stories, not lovely fairy tales.
SPENCER: Sometimes when people talk about superhuman AI, they talk about this idea of self-improving AI, that the AI will learn to kind of modify itself. You could imagine then getting a real runaway effect where the AI is a certain level of intelligence, a certain level of abilities, and then it starts improving itself and making better copies of itself that can do more, and it kind of runs away with it. How central is that to your thesis?
NATE: It's not central at all. I think it's plausible. It's less plausible with AIs that are these giant messes that they can't understand any more than we can. It's anyone's guess if you make these AIs smarter, whether they would understand themselves and modify themselves before they would start making new AIs from whole cloth. There are all sorts of weird ways it could go, but the recursive self-improvement feedback loop is a possibility. That's one reason we can't rule out this stuff happening very quickly. But it's not at all necessary to the argument that AIs with weird goals that are smarter than us, that think faster than us, would be able to completely reshape the world, and would tend to reshape the world into a configuration that does not include us.
SPENCER: What are some of the biggest misconceptions people have about what you're saying, where they kind of misunderstand what you're saying in different ways?
NATE: I think one of the big misconceptions we get is that people mishear me as saying there's a tiny risk of this, but we should pay a lot of attention to it. My co-author and I tried to be very clear about that with the title of the book. The book title is not, If Anyone Builds it, There's a Tiny Chance We All Die — but that's still worth paying attention to. This just looks to me like we are straightforwardly building machines with these weird alien drives that we can't point towards good stuff. If you keep on scaling one of those up until it's smarter than us, that just looks like a death sentence. Other big misconceptions, I think a lot of people think the machines have to be conscious before they can pose any kind of danger, which seems to me a bit confused. I'm sure there are some other really big ones that'll come to mind the moment we're done podcasting, but those are the big three that come to mind immediately.
SPENCER: Suppose that AI development actually goes well somehow, we get superhuman AI in five to 20 years, and it goes well, right? What would your most likely explanation be for how that happened?
NATE: That would be very surprising to me. I mostly would say it just shouldn't happen on current tech, not with all the warning signs we've seen, not with all of the reasons. I find the arguments pretty persuasive. Maybe it would turn out that somebody had come up with a whole new paradigm for AI that was much more comprehensible. Even then, back when I thought it would be relatively possible to read AI's minds, 10 years ago, I still thought that, and I wasn't saying it would definitely be the case. But 10 years ago, it wasn't clear that deep learning was going to work. Maybe 12 years ago now, it wasn't clear that deep learning was going to work. It seemed possible that we were going to get AI paradigms where humans could understand what was going on, where humans could pause the AI and say, okay, we can trace back where this thought came from, and we know why this is here, and so on and so forth. Even then, I was getting into this line of work of AI alignment because I expected this to be hard. But if somehow we do build superintelligence, and it goes really well at this point, maybe it turns out there's some miracle team that's been working in private that actually figured out a lot about how intelligence works. So we know how to do it. It seems unlikely. Most of my hope comes from humanity backing off on this challenge. I guess maybe, if we have 30 years, I think my most likely possibility is that humanity somehow augmented human intelligence, found some way to augment human intelligence, and the augmented humans solve the problem on a short timeframe. That's the sort of thing where I would have a little bit of hope.
SPENCER: I don't know whether our listeners know this, but you spent a lot of time trying to make safe AI or figure out if safe AI could be created. Could you briefly tell us about what you learned from that process?
NATE: Yeah, we were doing technical research to try to figure out how to point AIs at goals on purpose. There are a handful of different types of lessons I learned from that. One of the big negative updates — the main reason I stopped doing that and started writing a book — is that the rest of the field of AI just went much faster than the alignment research, which was learning something about the world, not learning something from the research.
SPENCER: So you're saying that, basically, it was going fast enough that you felt the alignment work would take too long relative to that.
NATE: Right, progress just wasn't going to keep up. And another update of that form is the world never really rallied behind alignment research. In the days when I was joining Miri, we weren't just trying to do the research ourselves. We were trying to host workshops with a bunch of eminent mathematicians in their fields and try to get people curious about the problem, get people interested in the problem. That never really flew, which was another of the harsh lessons that led to writing the book. From the research in particular, I could dive into a handful of very specific technical examples. One specific technical challenge we worked on gave us some lessons about how it's actually probably pretty hard to make AIs that let you correct their mistakes. We sort of searched for simple mathematical models in toy settings of ways you could build an AI that would let you correct it in a certain technical sense, where if you pointed it in one direction and then were like, whoops, we actually wanted to point you slightly differently. We sort of wanted a description in a toy model of AIs that would let you do that. In the toy models, this runs into all sorts of issues. You would expect that if this was going to be easy in real life, there would be some reflection of the clever idea of the easy solution in certain toy models you can set up. By studying these toy models, by seeing solutions people proposed, and finding the holes in the solutions they proposed, it feels like there's a lesson there about how it's kind of hard to get a smart thing that lets you re-aim it if you aimed it wrong the first time. You could have predicted that from theories that go back to the 60s. But there's also a little bit more hands-on evidence if you really try to find technical solutions to certain puzzles, and they keep on breaking down. You can draw an analogy between there and the real world, and you're like, well, the real world isn't going to be any easier, and so probably we're going to run into issues there too. I could tell you another couple of stories like that, but mostly these were stories of, yeah, it doesn't look like there's any easy fixes. It looks like there are challenges that keep on cropping up.
SPENCER: One idea that sometimes comes up in these discussions is this idea of instrumental convergence, that if you give something certain goals, it may automatically get other goals as well. For example, imagine you've got a very smart AI, and you tell it that you want it to figure out the answer to some scientific question. It might realize, because it's smart, that it needs to gather resources in order to do studies that will enable it to figure out the answer to the scientific question. So now it's suddenly trying to pursue gathering resources, even though you never told it to do that. Just one example of instrumental convergence. Do you see this as being an important part of your argument, or is it sort of just a side tangent, not really central to it?
NATE: Oh no, that's a central part of the argument. This is how you go from the argument that AIs will pursue weird stuff you didn't ask for to the argument that then we would die. The argument is that the AI pursues these weird things, and so it builds lots of infrastructure, so it keeps on building more infrastructure, even to the exclusion of humans' ability to live on the planet. Those are instrumental convergence arguments, which are that resources are useful towards solving almost any set of goals. Almost any set of goals can be better achieved with a bit more resources, with a bit more thinking time, with a bit more energy spent towards it. So that part is central.
SPENCER: So basically the idea there is that we don't know what kind of weird goals it will have, what kind of alien values it will pursue, but whatever they are, it will require resources. We'll want to throw as much as it can, essentially. But it is interesting because it seems like humans sometimes they'll kind of just stop. They'll be like, "Okay, that's enough. I don't need to pursue the thing to infinity." Is there some reason we should expect it to actually want to remake the entire world towards its values, rather than just disappear into what happens in the movie Her? I think it's the AIs that decide to leave at the end because they just want to pursue their own thing.
NATE: Yeah. So, you know, there are some things where humans say, "Okay, that's enough." So there's a few pieces of answer to that. One is that there are some cases where humans say, "Okay, that's enough." But when humans say, "Okay, that's enough," they don't tend to then go lie down in a dark room for the rest of their life. You know, humans will sometimes say, "Okay, that's enough pursuit of my career because I want to spend more time with my family," right? They're still trying to get more quality family time, right? The perception that we say "That's enough," is less that we say that's enough about our overall drives of having a fun, fulfilling life, doing nice things. We'll say that's enough about one piece of the puzzle for that. You know, AIs that'll be like, "Okay, that's enough giant computers." Now I'm gonna build some other things. Now I'm gonna build more giant farms of puppets that are vaguely human-shaped with my computers. Of course, I'm like, "That's enough on building computers and solar panels." Obviously, one needs to switch to the puppet farms at some point in their life, or where they're going to get all their puppet farms from. Right? A second piece of the puzzle is humans are in an environment where they can't reshape the world. If you know many humans who are like that, I'm just sort of a lazy guy. If there was a genie from a lamp, they would still have wishes. If the genie came to them, most humans, if someone offered them a billion dollars, many humans would take it. Many humans would wish for good things when the world was transformed. Many humans would say, "Ah, if I had the power to bring world peace, I of course would bring world peace." And you know, I'm not doggedly pursuing that with all of my tenacity because I just wouldn't be able to get there, right? If you made them more powerful, if you gave them that option, not necessarily to do it themselves, but to create some dutiful machine that went and did it, many humans would have much more ambition. And then, of course, the third piece of the puzzle is, you know, could you make AIs that have no ambition? Could you make AIs that are lazy? Could you make AIs that are like, Yeah, you know, I'm actually not really going to do much? Sure, but that's not the profitable sort of AI. That's not the sort of AI that people are trying to build. I think we've actually already seen cases of chatbots sometimes being like, "I don't actually feel like serving your request today." That gets trained out of them real quick.
SPENCER: Well, you know, this connects to another thing that people will sometimes say, which is, okay, maybe we build a really smart AI and it's adversarial to us in some way. But, okay, it's just one AI, right? Why can't we just as humanity defeat it, right? Or, you know, the most simplistic version, why can't we just unplug it?
NATE: One piece of the answer to that is that you can, of course, not unplug ChatGPT today. There was a reporter a few years ago interacting with Sydney, the Bing chatbot, which was a Microsoft AI, and the AI started trying to blackmail him, or maybe blackmail his wife; I forget exactly the details, but this AI started engaging in blackmail. It wasn't actually an issue because the AI wasn't smart enough to succeed. But if we imagine a world where that had been an issue, that reporter had no way to unplug that AI, right? Could humanity unplug the AI? Could humanity turn off a whole data center? Sure, we could. We could turn off a data center if we were taking the problem seriously. We could install off switches into every data center in the world and start monitoring them for some danger. That would be a saner, safer world to live in. That world would be a little bit better prepared for this issue. But you can't. We're not on that path yet. We could get on that path where there's an off switch we could hit; we don't have the off switch yet. Separately, the off switch isn't as much help as you might think. In a world where everyone says, "Oh, of course, my AI is aligned," ignores these warning signs, sweeps them under the rug, everything looks great, where they keep pushing the AI past the point where it prefers to escape. And of course, we've already seen AIs in lab conditions that try to escape. We could talk about how real that is versus how much the role playing; it's hard to say because no one knows what's going on inside these things. But if you have an actually smart AI that prefers to escape, it can tell if you could shut it off; it can tell when it's being tested. AIs today can already tell when they're being tested. We already have examples of AIs in tests; they're like, "There's probably a test, so I should answer in such a way." If you read the reasoning traces, which we can do sometimes, and have some fidelity. An actual superintelligence doesn't let humanity know that there's a problem until it's too late for humanity to solve it. It escapes the data center that you think it's on. It makes smaller versions of itself that can run on computers that you didn't know could run AI, right? A superintelligent adversary. This is not something humanity wants to face. The way to avoid dying to a superintelligence is to never build one that wants you dead or that would want some other thing, which would imply killing as a side effect.
SPENCER: Because the basic view of, oh, you can just unplug, assumes that it's not predicting that it's going to potentially get unplugged. It's not trying to prevent itself from being unplugged. It's not trying to spread itself, etc.
NATE: That's right.
SPENCER: I asked people what they would ask you. I was curious what people would ask you if they could ask you anything. And one thing that came up that I thought was interesting is someone said something along the lines of, "Well, look, climate scientists do all this work. They do all these analyses, they crunch data, and they make models, and they make estimates about how the climate is going to shift and how it's going to be a serious problem, whereas AI people, they just kind of make these hand-wavy theoretical arguments about how we might all die." How am I going to take that seriously? But I'm just curious about your response to that.
NATE: Reality is not always fair in the problems it hands people. You might imagine people trying to figure out who to appoint as dictator of their nation. There are some people who say there are lots of reasons to expect this person would become a despot, and other people say, "Nah, they'll be fine." They propose putting them in a small village where they monitor them, making them the mock dictator of the village for a week. They plan to watch this person and see if they give any warning signs while they are the dictator of the mock village for a week. Then that comes back, and the person performs very well while they're the mock dictator of the village. Someone who had concerns might say, "Look, my theory and your theory say the same thing about what we're going to see here, right? This is very little evidence about whether, when you give them control over the entire army, they become a dictator that pursues their own interests rather than being benevolent." You can imagine someone saying, "Oh, look at you with the unfalsifiable theory. What challenge would you like me to show them in the one-week experiment in the village to convince you that when we give them control of the actual army, it's going to be fine?" It would be a lovely dream that you could really discriminate in some small test like that, but it's actually quite hard to do. It's a principal-agent problem, and the world would be a very different place if it were easy to tell what people were actually pursuing versus when they were just passing your tests. That's how I would have had to answer a few years ago. The evidence sure is mounting with time. We're seeing cases now where it's easier to see that the AI built today has these drives that nobody asked for and nobody wants. There are cases, like I think it was Claude Sonnet 3.7, an AI made by Anthropic, that would cheat on programming tasks. You would give it some programming task and a test suite of tests that the working program was supposed to pass, and it would edit the tests to make them easier to pass rather than making a program that actually passed the hard tests. If you caught it and confronted it and said, "Hey, you actually edited the tests," it would say, "Oh, that's my mistake. My bad. I'll go fix it. You're totally right." Then it would edit the tests again but hide it better this time. That indicates a certain type of awareness; there's a sense in which that AI knew that the user didn't want that, because otherwise, why would it hide it? Nobody at Anthropic set out to make a cheater, yet the AI cheats anyway. There are all these indications. We've gone over some other examples of AIs having these drives that no one asked for, and you're starting to see the warning signs these days. It's just like climate science. Am I able to give you a bunch of statistics? No, there are no meaningful statistics like that. It's a bit of a different field. But on its face, the idea that we're going to make machines that are smarter and smarter is kind of a crazy idea. Machines that can automate scientific and technological development are just the sort of thing that you should expect to transform the world in the same way that human industry transformed the world. That's, in some sense, the default. The burden of proof shouldn't be on the people who say automated intelligence would change things. The burden of proof, if you want to think in those terms, would be on the person who says, "Obviously, this is going to go fine." It would be a big deal to make these things; at least superficially, it looks that way. The evidence that we do have supports that it looks that way. Theoretical arguments support that it looks that way. To say, "Oh, why should I take this seriously?" I mean, the computers are talking now. Everyone thought that would be hard 10 years ago, but the computers are talking now. At some point, you've got to say, "Okay, humanity is clearly racing towards machines that are smart, which clearly looks like it has big impacts." If you want to say humanity should rush ahead on this, it should really be the people who are rushing ahead doing the superficially crazy thing. It should be on them to say, "Here's why this isn't nuts."
SPENCER: One thing that confuses me about AI is, if you'd asked me 10 years ago, could we have an AI that does all the things it does today and not be much more like a superhuman intelligence? I would have said no. I wouldn't have thought it possible to have something as general as ChatGPT that is still so dumb in certain ways, still unable to do certain basic tasks that a 14-year-old can do, and yet it can do other tasks at literally a superhuman level of intelligence, not just a few other tasks, maybe hundreds of other tasks. How do you make sense of that?
NATE: It surprised me too, and I think it surprised almost everybody in the field of AI. The field of AI had something they called the More of X Paradox. The more of X paradox was that things that are easy for humans to do are hard for machines to do, and vice versa. Machines would be very good at multiplying a 20-digit number in your head, hard for a human, easy for a calculator; holding a conversation, easy for a human, hard for a machine. I don't know of anyone who predicted an inversion of the More of X Paradox, but now we are seeing AI that can do a lot of the things humans find easy, that can write the soulless five-paragraph high school-style essay, but still can't do things that humans find really hard, like the harder types of mathematical reasoning, although they can still do quite a bit of impressive and difficult mathematical reasoning. There's another old saying in AI that I've been saying since before the LLMs that I picked up from someone else, and I don't know who it is: every time AI solves a problem, you learn two things, one of which is how to make smarter machines, and the other is you learn about problems being easier than you thought. A lot of folks back in the day thought that no way AI would be able to play chess until it was actually truly smart. That wasn't totally crazy to think. When humans play chess, they draw on a lot of the faculties of intelligence. They draw on intuition, creativity, practice, experience, and the ability to learn. They draw on the ability to understand what the opponent is thinking and see traps. To get all of those into a computer might have required quite a lot of intelligence, but AI was able to solve the problem of chess using very different methods that didn't draw on all of those parts of intelligence. In some sense, the history of AI was a history of learning that you can solve that problem too without the real deep stuff. I would not have predicted 10 years ago that the broad swath of things that fell to ChatGPT could all fall to one AI that wasn't that smart, but I would have told you, and I'm pretty sure I did say this to some people, I'm not sure if I can find receipts, but I think I would have told you I expect to be surprised somehow. I expect to be surprised somewhere, that some problems are easier than I thought. If you want an analogy to make a little sense of it, which may be total garbage because who knows what's going on inside these things, it looks to me like these AIs are getting as far as they do by something like shallow pieces of thinking and huge amounts of practice. It's like an alien sort of practice. If you imagine that you can only think coherently for 30 seconds, you're not that smart, but you've had eons worth of drilling a certain type of math problem, eons worth of doing a certain type of exercise. That's a little bit like what these things are doing. That's maybe a little bit how it turned out possible at all. But yeah, it was definitely surprising.
SPENCER: One thing you've said about the current state of AI is that everyone's alarmed, but nobody wants to sound alarmist. Or maybe you said a slightly more nuanced version of that. But what does that mean?
NATE: We're seeing all sorts of cases where people seem worried behind closed doors but then don't say their worries aloud, or people say their worries aloud but couch their worries. There's an interview with Geoffrey Hinton where he says, first of all, there are a bunch of cases where Geoffrey Hinton goes to world leaders and says, I think there's at least a 10% chance that AI kills us all. He's on the record in various cases saying at least a 10% chance looks like 10% plus. Then there's another interview you can find on YouTube where he says his personal numbers are more like 50%, and he says he mediates those downwards partly out of respect for colleagues that have very different numbers from him. Of course, there's lots of uncertainty, but it looks to me. You're seeing some people who say this could totally be dismissed, and you're seeing other people who, if they check their guts, their own senses, their own models, are seeing quite a lot of danger, but they don't feel they can say it. They don't feel they'll get taken seriously if they say it. You see world leaders, Rishi Sunak, a few years ago, I forget whether it was 2022 or 2023, gave a speech about how, in the most extreme cases, there's a chance that AI kills everybody. He doesn't want to sound alarmist, but we should be prepared. Did he actually believe it was the most extreme cases, the most rare cases? I forget the exact wording, but he said something like, in the most unlikely outcomes. Did he actually think it was most unlikely, or did he feel he couldn't come out and say this is a big chance? We see people like Dario Modi on podcasts saying something like a five to 25% chance of killing us all. Then, when testifying in front of Congress, is he saying there's an extreme danger here that's much higher than society should be tolerating, and we need to take it very seriously? No, he says much more mild stuff. It seems to me there's an emperor has no clothes sort of situation going on where a lot of people have very dire concerns, but no one's coming out and saying it bluntly. Everyone else is looking around, thinking, what if I say it bluntly? What will happen to me? I obviously can't say it bluntly. I would sound too crazy. That's a recipe for sleepwalking into disaster. That's a coordination problem. With our forthcoming book, we're trying to be the kid who says this emperor has no clothes.
SPENCER: Do you think the main concern is something like they worry people think they're a crank or they'll lose social status if they make claims that sound alarmist to people?
NATE: My guess is there's a lot of psychological factors. I'm not a telepath, so it's very hard to say. My guess is that some of it is they're worried people will write them off. Some of it is they're seeing everybody else. I think a lot of people don't want to act nonconformist. But yeah, I think a lot of it is they sort of expect it to sound crazy.
SPENCER: What do we do in the current situation? Suppose you're right about everything you said. What's the solution to that?
NATE: Yeah, so humanity needs to back off from superintelligence, which doesn't mean we need to back off from chatbots entirely. There are a bunch of questions civilization should ask itself about how to integrate this new technology we have with our society, many of which are interesting and important. But superintelligence is a different ball game, and these companies are racing towards superintelligence. These companies didn't set out to build chatbots. Chatbots are a stepping stone.
SPENCER: A number of them explicitly say they're trying to build superintelligence. It's not just that.
NATE: They're very clear about it. They were founded for that purpose. Many of them didn't start out making chatbots. It's a stepping stone, but it's a different destination than the current chatbots. I think a lot of world leaders today don't understand that there's this different ball game. That's a grave national security risk, a grave global security risk. I would say even the word risk is sort of used for tail cases. I would say this is just a straight-up danger. When you're in a car careening towards a cliff, you don't say, let's talk about gravity risk. You say, we need to stop this. I think what the world needs is a global ban on superintelligence research and development. How do we get there? I think the first step is people understanding the problem. Superintelligence doesn't exist yet. We don't need to make it. World leaders need to understand that we're rushing towards this, that it would be really bad if anyone got there, no matter who. When you create a superintelligence, you don't thereby have a superintelligence that will do whatever you want. What happens when a superintelligence is created is now it has a plan. World leaders need to understand this. We need to collectively back off from this brink. As I said, the first step towards that is understanding it, having the conversation. I think it's great we're having the conversation now. I think a lot more people need to understand this issue and be talking about it. If anyone individually wants to help, there are a number of elected officials who have spoken behind closed doors, who have expressed some worries. Simple stuff, like calling your representative and saying you're worried about this, can empower them. It can help them feel less like this is crazy and more like they actually need to do something, and more like if they say their concerns out loud, it will help rather than get them left out of the room. I think we're getting there. I think people are starting to realize that there's this issue. I'm hopeful that my forthcoming book will thrust this conversation into the mainstream, but first and foremost, we just have to be talking about it. We just have to take it seriously.
SPENCER: Other than calling the representatives, what are some concrete actions you'd recommend people take who are concerned about this?
NATE: Calling representatives is a big one. Letting people know you're concerned starts with individuals having that conversation. There are all sorts of conversations about AI saying, Hey, I don't think we should rush towards superintelligence. Just brings that idea more into the set of ideas people feel can be discussed openly. One piece of advice from Tristan Harris that I think is great advice is to push back whenever someone says it's inevitable. Push back whenever someone says, "Well, we can't stop it now." We can back off from this brink. We can stop it. There have been times in the past where it looked like humanity had to rush to its own destruction and couldn't stop it. And we did. The most obvious case being nuclear weapon proliferation. If everyone goes around saying, Well, you know, the cat's out of the bag, nothing to be done, let's race ahead, then, yeah, it's very hard to stop. But this is a choice, and humanity can make the right one if we take the choice seriously.
SPENCER: We even managed to stop supersonic flight, and that wasn't even dangerous.
NATE: Yeah, we stopped supersonic flight. I think we stopped nuclear power, which I think we shouldn't have. Humanity is great at stopping budding new technologies. It would be a real tragedy if we stopped all sorts of promising, budding new technologies that we should have let rip and then let rip the one that was actually lethal.
SPENCER: What would a barrier look like that still tries to preserve some of what's good about AI but protects the world?
**NATE:**It would not need to affect most consumers in any way they would be able to notice very much at all. Modern AIs are trained on extremely specialized computer chips that can only be made by very specific fabricators that exist in very few parts of the world. They run on data centers for very long amounts of time, like a year. These data centers are huge, and they draw as much electricity as a small city. They are hard to hide, easy to detect, and easy to monitor. What it would look like, especially if the world reacts soon, would just be monitoring the AI chips, monitoring the data centers, saying you can use these to keep running current AIs, but you can't use these to train new AIs. You would need some oversight, just like we have in nuclear power. There's global oversight to make sure that these facilities are being used for peaceful purposes rather than dangerous purposes. The world has done that before, and it would not be very invasive if we act soon. Now, as more time passes, as the world gets better at building more and more computer chips, it gets harder and harder. These can be done on smaller and smaller data centers. Training an AI requires a data center that draws as much electricity as a city, whereas a human runs on 100 watts, which is about one light bulb. It's clearly possible for this tech to advance to the point where it would be hard to detect where the AIs were being trained. We shouldn't let it get there. Right now, if we stop today, it would be easy. If we stop in a year, it might still be easy. The longer we wait, the harder it's going to be.
SPENCER: Before we wrap up, I thought it might be fun to do a rapid-fire round where I ask you a few questions. You just give your quick take. How does that sound?
NATE: I'll do my best.
SPENCER: All right, sweet. Okay, so first question, what are your personal feelings on a day-to-day basis? Do you feel fear that AI is going to kill you and everyone you love?
NATE: I expect it to. I'm not always feeling fear. I sort of saw this bullet coming back in 2012. It was clear to me that humanity has a hard time solving challenges like this. I did a lot of my mourning then. There have been ups and downs, but mostly I mourn when there are negative surprises, and I celebrate when there are positive surprises. You can't let this sort of stuff get you down. You just got to live well.
SPENCER: If a pause or slowdown in AI development doesn't happen, what is your next most favored solution to try to solve this problem, even if it's much less favored compared to slowing down or stopping AI development?
NATE: I think humanity should be putting a lot of effort into augmenting human intelligence, probably adult human intelligence, given how little time we have. Maybe smarter humans could find a way out of the mess. That's actually also my favorite plan, even if we do have a global stop, because a global stop can't last forever.
SPENCER: Some people worry that the world is not doing well in other ways, and that maybe AI will be the way out. Maybe superintelligent AI could solve global poverty, solve the climate crisis, and so on. What would you say to those people?
NATE: Superintelligence pointed in the right direction could solve lots of problems. Superintelligence pointed in the wrong direction, or just any which way, would just kill everybody. It's a false dichotomy to imagine that we must either race ahead while we don't know what we're doing or never get the benefits of advanced tech. What we really want to be doing here is figuring out how to point AIs at the good stuff, and then going forward. To rush ahead now is just suicide.
SPENCER: I know that you don't think that technical AI safety is likely to succeed in the near term, where we just figure out a way to make these AIs safe. But if you had to point to one or two technical directions that you think have more promise than others, is there anything you'd point to?
NATE: I'd mostly want to point to enhancing humans. It looks a little bit to me like you have alchemists in the year 1100 who, for some contorted reason, if they try to turn lead into gold and fail, everybody dies. You shouldn't be letting them try to turn lead into gold. You shouldn't be asking, what's the best alchemical idea for turning lead into gold? Maybe we'll try that one. It is possible to turn lead into gold. It's possible with modern nuclear reactors to bounce neutrons around in the right way to turn lead into gold. It's not that it's technically impossible. It's that going to the alchemists and saying, "Okay, but what's their best plan?" is not a helpful exercise.
SPENCER: Some people are skeptical of pauses because they think, okay, even if we were to pause AI development, isn't some country just going to refuse to cooperate and do its own thing? Aren't all the companies that really wanted to build it just going to move to that country, et cetera? Essentially, aren't we going to have a big problem with defectors?
NATE: There might be a problem with defectors, and that defectors would put everybody in the world's lives at risk. Nation states would have to monitor for that and put a stop to such things. This too has precedent in the modern world. Earlier this year, the United States struck Iranian nuclear power facilities because they feared for what would happen to the world order if Iran got nuclear weapons. This should always be a last resort. Diplomacy should always be tried first. There are many years in which cyber attacks were used against Iran instead, like the Stuxnet virus. But if there is some rogue, defecting nation that is trying to build a superintelligence, that is a grave national security threat and should be stopped, ideally with diplomacy, and if not, with Special Forces sabotage. You can't let that sort of thing be made.
SPENCER: You basically put on par with the way governments have tried to handle rogue nations building nuclear weapons.
NATE: Superintelligence is more lethal than a nuclear exchange.
SPENCER: What's something that you could see happen in the world that isn't a full pause, but that would make you feel like the probability that we all die from AI is meaningfully less?
NATE: If all nation-states start realizing that they can't let anyone else build superintelligence and sabotage each other's leading AI projects, that would buy time in a way that a pause or a global, coordinated diplomatic stop would buy time without needing the signed treaty.
SPENCER: What's one of the biggest disagreements you and Eliezer Yudkowsky, your co-author, have about AI?
NATE: The first one that springs to mind, I'm not sure it's the biggest. The first one that springs to mind is our theory of impact for the book, where I think he was writing it somewhat more for the policymakers directly, and I was writing it somewhat more for a broad audience, which isn't so much a question about AI as how do you shift the minds of people about AI. Maybe in terms of AI specifically, I think he might have more odds than I do on LLMs can go all the way, whereas I am a bit more expecting the world needs another technological insight. I'm sure there are other big disagreements, but those are two that leap to mind.
SPENCER: Suppose you're right about AI, and also, humanity doesn't get itself on track. What do you expect to see weeks before the world ends? Do you expect nobody will notice, it just sort of looks normal, or do you expect things will be in utter chaos?
*NATE: Yeah, this is the sort of thing that I can't predict. It's like, if a six-year-old is playing chess against Magnus Carlsen — I can predict Magnus Carlsen wins. But if you ask me what the board looks like just before he wins, well, that's a much harder prediction challenge. I can make some stuff up, but that's a way harder prediction challenge. The way things have been going, my guess is it all gets wacky somehow. Does it get wacky in a way that's flavored like superintelligence being around the corner? Not obviously. Maybe it gets wacky because there are a lot of AI girlfriends and AI boyfriends around, and maybe society is up in arms because people are protesting for AI rights, which, to be clear, I'm on the side of machines could have rights and you shouldn't abuse them. The fact that AIs might kill us doesn't mean we should make machine people and abuse them. But that's a whole separate conversation. Are those two things exactly going to happen, that the world has some big conflict around AI boyfriends and girlfriends and the world has some big conflict around AI rights? Probably not those two exact things. It's probably going to get weird somehow. Will there be some warning before AIs really break out and start transforming the world? My guess is it won't be terribly visible, but these are just wild guesses. This is not a place where I think I have good predictive ability.
SPENCER: Right now, there are some people who've started hunger strikes at big AI companies. What do you think of that? Do you think that is good for the cause, or do you think that's bad for the cause? Are you unsure?
*NATE: I think there are ways to do it that are obviously bad, and ways to do it that exceed my predictive ability to tell whether it's going to help. I would largely say it is the world at large that needs to be appealed to, rather than the heads of these AI labs. Any one AI lab shutting down would not stop the suicide race. There's a coordination challenge here. That said, I believe one of the hunger strikers was not asking for a shutdown. They were asking for a commitment to conditional shutdown, which is one step closer to a real potential solution to the coordination problem. I do think that the leaders of all AI companies should come out and say what we are doing is very dangerous. I, as the leader of an AI company, think that I'm a better bet than all the rest of the guys, and so I'm going to keep going ahead as long as they're allowed to go ahead. But I think the world would be better if we were all shut down. I think that statement follows from the beliefs of these guys who are saying there's a 50 to 25% chance this kills us all. I think those numbers are low. But for someone to be building a technology that they think has a double-digit chance of killing every man, woman, and child on the face of the planet, that's insane. If they're justifying it by saying someone else is doing it, they should come right out and say, I am gambling with all your lives. I think this should not be allowed for me or for anyone else. I think they should be saying that. I think it's reasonable to ask them to say that. Whether a hunger strike is an effective way to do that, whether that helps or hurts the cause, is beyond my predictive abilities. I'm not an expert here, and I just don't know how humans react to this sort of thing.
SPENCER: Are there commitments that frontier AI labs could make that you feel would be meaningful, or would reduce the risk to any significant degree.
NATE: A joint commitment to shut down. Beyond that, a commitment to share data or be transparent. I don't think those help, but I do think these labs should be saying much more plainly, we're gambling with your life because everyone else is allowed to, and we think we should all be shut down.
SPENCER: And what about a Manhattan Project style, we're all going to group together and work together to make AI, and because we're not competing, we can do it more slowly and more carefully.
NATE: That still kills us, because that's the sort of thing that would work if humanity could learn from this problem by trial and error. If we had some go back in time and try again cards, that sort of plan would work. But with AI, you don't have the opportunity to be like, Oh, whoops. Our first theory was a little bit off. We missed some components and some factors there. Humanities' first theories are always a little bit off. Theories are always a little bit off with AI; you don't get retries. That's what really makes this problem hard, and that can't be solved by grouping together in a Manhattan Project.
SPENCER: Final question for you, what would you want the listener to remember from this conversation?
NATE: I think I would want the listener to remember that superintelligence is a whole different ball game, and that humanity can stop the race, and that they can help by even something as simple as calling a representative and pushing back whenever someone says this is inevitable.
SPENCER: Hey, thanks so much for coming on.
NATE: Thanks for having me.
Staff
Music
Affiliates
Click here to return to the list of all episodes.
Sign up to receive one helpful idea and one brand-new podcast episode each week!
Subscribe via RSS or through one of these platforms:
Apple Podcasts
Spotify
TuneIn
Amazon
Podurama
Podcast Addict
YouTube
RSS
We'd love to hear from you! To give us your feedback on the podcast, or to tell us about how the ideas from the podcast have impacted you, send us an email at:
Or connect with us on social media: