Episode 273: AI: Autonomous or controllable? Pick one (with Anthony Aguirre)

Enjoying the episode? Want to listen later? Subscribe on any of these apps or stores to be notified when we release new episodes:

Listen on

Apple Podcasts

July 31, 2025

Is AI that's both superintelligent and aligned even possible? Does increased intelligence necessarily entail decreased controllability? What's the difference between "safe" and "under control"? There seems to be a fundamental tension between autonomy and control, so is it conceivable that we could create superintelligent AIs that are both autonomous enough to do things that matter and also controllable enough for us to manage them? Is general intelligence needed for anything that matters? What kinds of regulations on AI might help to ensure a safe future? Should we stop working towards superintelligent AI completely? How hard would it be to implement a global ban on superintelligent AI development? What might good epistemic infrastructure look like? What's the right way to think about entropy? What kinds of questions are prediction markets best suited to answer? How can we move from having good predictions to making good decisions? Are we living in a simulation? Is it a good idea to make AI models open-source?

Anthony Aguirre is the Executive Director of the Future of Life Institute, an NGO examining the implications of transformative technologies, particularly AI. He is also the Faggin Professor of the Physics of Information at UC Santa Cruz, where his research spans foundational physics to AI policy. Aguirre co-founded Metaculus, a platform leveraging collective intelligence to forecast science and technology developments, and the Foundational Questions Institute, supporting fundamental physics research. Aguirre did his PhD at Harvard University and Postdoctoral work as a member at the Institute for Advanced Study in Princeton. Learn more about him at his website, anthony-aguirre.com; follow him on X / Twitter at @anthonynaguirre, or email him at contact@futureoflife.org.

Further reading

JOSH: Hello, and welcome to Clearer Thinking with Spencer Greenberg, the podcast about ideas that matter. I'm Josh Castle, the producer of the podcast, and I'm so glad you've joined us today. In this episode, Spencer speaks with Anthony Aguirre about

SPENCER: Anthony, welcome.

ANTHONY: Thank you. Great to be here.

SPENCER: Is it possible to make superintelligent AI that is safe?

ANTHONY: I hope so, but I fear maybe not. I think there's a crucial distinction in my mind between safe and under control.

SPENCER: What's that distinction?

ANTHONY: Well, I think you can have something that is aligned and makes you safe. Parents are aligned and tend to try really hard to make their children safe, but we don't tend to think of parents as under the control of their children. So it's quite a different relationship that they have. Whereas you can imagine something that is sort of under your control, but not necessarily aligned with you. A CEO can order around their employees, and employees might totally disagree with the CEO. They might have different politics or be misaligned in all sorts of ways, but they might nonetheless be under control in the sense of doing what the CEO says. So I think there are two quite different and important relationships we can talk about there with us and AGI or superintelligence.

SPENCER: Do you think we have to worry about both of those as separate topics? If we're going to make an AI that's extremely intelligent, we need to make it safe and also ensure it is under our control?

ANTHONY: I think so. Because I think that does point to two different pictures for what the human future looks like. My personal belief at the moment is that superintelligence, when we get to true superintelligence, is not going to be controllable. There's a big enough gulf between what humans can do and what superintelligence can do, both in speed and complexity and depth. The control is just not going to be a relationship that they can have, like a fern controlling General Motors. The relationship can't go that direction. So for superintelligence, I think we probably have to have alignment if we're going to survive. On the other hand, I think what people mostly think of when they think of AI is a tool that they want to have to do the things they want. If you ask people what attributes they want in their AI systems, as they get more powerful, a lot of it is, "I want it to do what I say. I want it to do what I'm telling it to do. I want it to not do things that I don't want it to do, and so on." Control is, for most of our systems that we're used to so far, the operative thing. If you tell people nowadays there's going to be an AI system that is not under your control and is going to do what it wants, most people find that sort of alarming at this point, and you'd have to go through a lot of extra steps to explain why that's okay for X, Y, and Z reasons.

SPENCER: Why do you link having very high levels of intelligence with lack of control? At least you could imagine a super powerful genie from stories that is still under the control of whoever owns the lamp. What's the kind of tension there?

ANTHONY: Sort of. I think, just as one example, suppose you have me, but I run a thousand times faster physically and mentally than you. Can you keep me under control? It might be you, or it might be the United States government, or whatever else. I think the answer is basically no; there's no controlling me if I operate a thousand times faster than you. Now, that's if I don't want to be controlled. Even if I do want to be a good citizen or a good employee, it's quite hard because I'm doing my thing, and I have to wait around for hours or days to get one little bit of information or feedback from you in terms of a control signal. You can tell me something very high level, like, "Go do X," like, "Go invent a new nuclear fusion reactor." I can go off and try to work on that, but I will be taking that very high-level imperative. If I have any questions, like, "Do you want your nuclear reactor to do X or Y?" or "How big should it be? Should it be 10 gigawatts or 100 gigawatts?" any kind of feedback I would get from you is going to be incredibly annoying for me to obtain. I'm probably just going to go ahead and make a bunch of those decisions. Otherwise, I'm going to be super hamstrung. If I want to have a real control relationship where you understand what I'm doing and can give me feedback, we could talk about various different criteria for control, like being able to understand what I'm doing in a legible way so that you know what I am and am not doing and why I'm doing the things I'm doing. You might want to be able to change my goals or change what I can and can't do, or you might want to turn me off or say no if I wasn't a person. You can think of the various aspects of control, and all of them become more and more difficult and under a lot more tension as the disparity between speed, size, complexity, or depth gets to a certain extent. I fear that is where we will be with superintelligence.

SPENCER: These discussions can be very abstract, so I like to sometimes use an example that's sort of adjacently possible, something you could actually see LLMs doing today. Suppose that I have the best LLM today, but I hook it up so that it can make commands that allow it to do all kinds of things in the world, like it can send emails on my behalf, log into my bank account, log into my Twitter, etc. Then, let's say I give it a goal, like, "Okay, find a way to make me money." I go to sleep and let it do its thing. You can imagine already immediately asking questions like, "Okay, what is it going to do? Will it do things that I actually am not comfortable with? How do I know that even what it did, after the fact, maybe it did 20,000 things while I was asleep? How do I even go back and see if I'm okay with what it did?"

ANTHONY: Yeah, exactly. Almost certainly, it's going to do stuff that you didn't approve of. If all you said was, "Go and make me money," we know that this is a bad idea to just give a complicated AI system a single thing to optimize because all of the other things that you haven't constrained in any way are going to go off in strange directions that you don't necessarily want. Unless you bound off the different directions the optimizing is going to push it in, whatever direction is defined by that thing being more optimal usually won't coincide with "doesn't break the law," or "doesn't do something unethical," or "doesn't get me in trouble" in some way. All of those constraints would have to be added in if you didn't want it to go off in those directions. That's the core problem of AI alignment. If you have something that's optimizing, it's going to do stuff that you don't want in general. In that example, it's a great example; almost certainly you would get up and that machine would have done stuff that you weren't super psyched about. If you want it to only do stuff that you're happy about, you either have to have it know in incredible detail all of your preferences and really guess at what you're comfortable with and what you're not, and that's solving alignment really well, or you have to have a control situation where you can actually observe and say, "Okay, yeah, this is something I'm comfortable with." You have that interactive experience with it. You are able to redirect it when it's not doing the right thing, and it can learn from your redirections. When it goes and does the wrong thing, you can stop it from doing the wrong thing in time. You can turn it off. All of those things are the sort of control criteria you would want to have. It's easy to imagine, even with today's AI, they're not scary in the sense of going rogue and taking over the world. But it's clear, just in the way that you said, once you're able to have systems that are autonomously doing things, there's a profound tension between that and them being under control. They're almost antithetical in the sense that autonomy is the ability to keep acting without that human input or oversight, and control directly requires that human input or oversight. You can have something that's autonomous and aligned and is maybe pursuing goals that you would endorse, but it's very difficult to have it be something that is really under control.

SPENCER: Even if somehow I could give the AI all my values and give it some way of checking if I was okay with everything that it did, it seems that's not even enough, because it might do things that I'm okay with but that someone else is not okay with. Suppose it's under the power of someone who really doesn't mind hurting people or destroying society. They might give it goals that cause a great deal of harm.

ANTHONY: Yeah, of course. So this is the really hard part of alignment. What is that alignment possibly with? I'm not sure if that's a harder part or an easier part than the other part of alignment, which is how to make an AI actually truly and deeply care about what anybody else cares about, rather than just acting as if they do? But the big question, as soon as you say, "I've succeeded in alignment," the next question, obviously, is with, "What exactly?" As you say, we sort of already know that it's very difficult to figure out what the preferences of more than one agent are, and even one is not so easy, but more than one is super hard. In fact, there are impossibility results in whether you can get preferences from a collection of people that satisfy certain requirements that they would all endorse as requirements for a joint set of preferences. So we can, at best, have something that's going to be a generally accepted set of norms, or something that is quite hard to define and that lots of people are going to disagree with. If you have an AI system that is under any one person's control or is aligned with any one person, there are inevitably going to be disagreements with what other people want. At some level, I think we're used to that. We're used to people doing their own thing and having disagreements with each other, and we're used to having tools that empower people to do what they want. We're used to having a computer that can do things that other people don't want it to do, because I'm telling it to do those things, and I'm doing things that they don't want me to do, because we disagree. I think the idea that we're going to have something that is just aligned in the abstract, not aligned with any particular person or group or aligned with humanity or whatever, seems very hard, because, again, what does that mean?

SPENCER: It seems that the more general the behavior is, the more these issues come to light. Maybe to some extent, people haven't seen these issues as much, because mostly they're using LLMs to just do simple question answering. If you're doing question-answering, the AI is just running its neural net. It may be Google searching. That's it. It's not making these complex decisions about which of a hundred actions to take, which could take it in all different paths across a wide range of different behaviors.

ANTHONY: Right. I think we've gone through a couple of different stages of AI, and we're sort of trying to enter the third and maybe fourth. AI started with narrow AI. We had image recognition systems and speech recognition, and Go playing, and Atari playing, and all these things that did basically one thing, and that is, at some level, a solved problem. We now know that if you can come up with the right metric and reward signal for a narrow system, you can basically train a narrow system to do that thing. Where we can't have a narrow system, or we feel like we should, like autonomous vehicles, it turns out that they're actually not that narrow, where the reward signal is hard to figure out. But where it's clear we can't have a narrow system is kind of a solved problem. Then we've moved on to general systems, and we've kind of solved generality in a lot of ways. People say, "AGI is far away," but obviously we have artificial intelligence that is general. In certain ways, it's way more general than people are. In large ways, we've solved generality. The thing we really haven't solved, I think, is autonomy. We've solved it a little bit in very narrow systems. We haven't solved it in intelligent and general systems. The difference between narrow systems and general systems is a really big one, but people have kind of gotten used to it. The difference between intelligent and general systems and autonomous systems that have those properties as well is something that people are not really internalizing, and we're not really prepared for. For me, the useful way to think about AGI is really as autonomous general intelligence, rather than just artificial intelligence. I think the crucial thing is combining those three: intelligence, generality, and autonomy, because that's what we're lacking now and what companies are trying very hard to achieve. Once you have that autonomy, in addition to the other two, it really opens up a Pandora's box in terms of control, especially, but also all these other things. How aligned it really matters in a way that it didn't? Because for a tool, for a language model, all it does is give you text and you decide what to do with it. It has to come back to you for instruction frequently enough that it stays aligned with you. You can correct it if it's doing things or saying things or thinking along directions you don't like. Once we get to very autonomous systems, by definition, at some level, you won't have that step-by-step oversight and interaction and re-guidance of the system. It's going to be off doing its own thing. It immediately becomes much more important that it be aligned and that it be doing things you want. It also opens up other failure modes, like the speed failure mode that we talked about before. If you have a system that has to keep coming back to you for feedback or oversight or correction, then it runs at your speed. It can do complicated stuff, but it has to stop. Once you have an autonomous system, it can just run at 50 times human speed and keep doing stuff. There's no way as a human you're going to be able to oversee that or keep up with it in a meaningful way. Once you have an autonomous system, you can have 50 autonomous systems doing something in parallel. There's no way you can oversee that or stay on top of it either. I think autonomy and generality are important, especially when you get to autonomy. But I find autonomy the most worrisome, in the sense that it's going to open up the biggest panoply of problems that haven't really been there for just intelligent and general systems.

SPENCER: Yeah, it's really amazing how recently we've gotten these general systems. It is easy to forget that AI for so long was so narrow. I recently asked a question on social media about what ways people use AI in their life, and I included in that things like Netflix's recommendation algorithm. A bunch of people were like, "Oh, that's not AI," and it's just so funny because three years ago, that was mostly what AI was. Things like that. And now, we assume one AI can both translate languages and give us recipes and tell us about facts. It used to be language translation systems that would spend years and years, and the only thing it would do is translate languages.

ANTHONY: And it's an interesting question, to what degree that is a choice, to what degree it is sort of forced on us by the nature of things. It may be that the way to get a lot of particular capabilities is to build a general system and then tell it to do that particular thing. That seems to be true for some things. If we just wanted to create a poetry writer that also wouldn't write essays or do other language model things, it would probably be pretty hard to write a poetry writer without having that general capability. On the other hand, we know that there are special systems like AlphaFold that would be really hard to get out of a language model. You sort of can in certain ways; there are transformer versions of those things, but it's going to require certain technical tools, and there are things that are going to be quite where it really seems to be a strength to do them narrowly.

SPENCER: You think about AlphaFold for predicting protein folding, right?

ANTHONY: That's right. Yeah, so and AlphaGo and AlphaWeather, whatever it's called, and Alpha, whatever material science is. DeepMind has made this whole set of quite narrow systems to solve particular scientific problems, which are awesome. I'm a super fan of narrow, powerful AI for scientific discovery because it really can do things that no human can. We struggled with protein folding for decades and couldn't crack it. I have real concerns about AGI and whether we should build it or superintelligence at all. I think we can get so much out of narrow systems that are guided by some general but not super intelligent or very autonomous systems. I'm a big fan of thinking about, for a given problem, what sort of AI do we actually need to solve that problem? The current direction is, "Let's build an AI system that does everything, super general, super autonomous, that doesn't need any people, and then let's figure out how to apply it to any particular problem. It'll probably do a pretty good job or a very good job." But if you just want a personal assistant to manage your calendar, you also have to have a nuclear physicist and virologist and Nobel Prize-winning mathematician along for the ride because that's just the way it works. My question is, do we have to do that? Can we not have much more specialized systems that lean into a large degree of narrowness and capability, a high level of intelligence, but without that much generality, without that much autonomy if we want to do stuff like AlphaFold? Or lots of autonomy but not that smart or general, like a self-driving car? Or can we have stuff that is general but not super smart and not super autonomous, like a personal assistant? Maybe it's a little bit autonomous. I think there's a different sort of developmental paradigm, which I'm in favor of, where instead of just building more and more powerful systems and then saying, "Okay, let's build them and then see what the problems are that we can apply them to," we look at the problems first and say, "Okay, how do we solve this problem? What are the sorts of systems that we need to do that?" Whether the problem is a scientific problem or being a good, trustworthy personal assistant that doesn't screw up your calendar and mess up your appointments or whatever, I really like that problem-first approach because then you can target the sort of thing that you're building to the problem at hand. It avoids a lot of these terrible side effects and complications you get from making something that's way too general and way too powerful for the actual purpose you need it for. You don't need Einstein to pick your recipes; you need a really good recipe picker.

SPENCER: How do you think about making that actionable? Because it seems really clear that there are companies rushing as fast as possible to try to make AI that is both general, as smart as possible, and as autonomous as possible. They probably have an incentive to do that because they can make more money if they can do all those things simultaneously.

ANTHONY: Yeah. I think this is not going to happen by itself. You're absolutely right that the market dynamics are mostly pushing in that direction. There are some exceptions. The energy and compute cost constraints do actually push you a little bit more toward lighter weight and more narrow models, rather than just having everybody use Claude Opus 7 for every little thing. Already, you see the companies branching out into lighter weight, faster models for certain purposes. My guess is we will probably also see some more specialized models for more specific purposes, although they haven't gone that direction that much yet. There may be some forces that push you in that direction just because of costs. But I think if we really want to discourage the very unsafe directions, the market forces alone are not going to do that for us. That will take government or policy action based on public welfare or pro-social aspects, and that, unfortunately, is difficult because the companies don't like restrictions on what they're doing and push back against them. Nonetheless, we have for basically every other industry that has had real strong interaction with the real world, other than maybe social media and a few other Internet things, had regulations, safeguards, standards, and bodies that oversee things. All of those things have been a part of every other major technology we've had, so it seems eminently reasonable to have those sorts of things for AI as well. Just the developmental pathway they've been under so far has led to us not having those things. Now the numbers are so big, the companies are so big, and the power they have is so big that it's getting hard for the government to do the normal thing, which is help regulate, put safeguards in place, and develop standards, all the things that the government has done for many other technologies.

[promo]

SPENCER: What do you think the right kind of regulation would be? Would it be on compute? Would it be some limiting combination of how general it is, how smart it is, how autonomous it is, or something else?

ANTHONY: Yeah, some combination of those. I wrote this piece, Keep the Future Human, where I laid out some pretty detailed policy ideas that I would endorse. For me, it's a combination of compute. If you don't want intelligence to run away into strong AGI and superintelligence, the way to do that, at least in the short term, is through compute. I think there will be algorithmic efficiency gains, so you can use similar levels of computation, but it's hard for me to imagine arbitrarily powerful AI systems without scaling up the compute much higher. So I think, as a first approximation, you can get at it pretty much through compute, both the training compute and the inference, how much compute is used during runtime. I think the way to get at the more subtle parts is probably through liability, partly through safety standards. I'm in favor of basic safety standards, which include controllability; we should not have uncontrollable systems. I'm a fan of liability where the liability is stronger, probably strict liability for systems that are all three: autonomous, intelligent, and general. There should be safe harbors for lower levels of liability and lower levels of oversight and restriction for models that just have two of those properties or one of those properties. If you're building a totally narrow image classifier system, there should be no regulation at all on that. Maybe if you're building an image generation system, you should be thinking about deep fakes and non-consensual sexual imagery and things like that. There are fights to be fought about where the legal lines versus propriety lines should lie for something like that. But as you start to combine multiple of those properties and get up toward all three, which is AGI, there should be stronger and stronger liability reflecting the fact that there's more intrinsic danger to those systems, more of a strong trade-off between benefiting the company and causing risk to society. My favorite is compute caps to forestall superintelligence, which I think we are absolutely not at all close to ready for, safety standards to put basic safety measures in control, and the more powerful the AI system, the stronger those standards and requirements should be. Then liability to deal with the complex cases and put a lot of the onus of responsibility on companies themselves to work out how to make their systems safe. That liability has to be strong, however, so it isn't just a few dollars here and there, like an epsilon fraction of their revenue, which they're not going to care about.

SPENCER: So basically, by giving them liability for bad things their models do, that creates a financial incentive for them to really care a great deal about risk.

ANTHONY: Yes, so that, I think, is the zeroth-order. Higher liability makes them have much more skin in the game. I think you can use it a little bit more fine-grained than that in providing high levels of liability if they develop the most dangerous possible systems. But you can encourage particular developmental directions by saying, "Look, you'll get a safe harbor if your system is below this amount of compute. You just don't have to worry about strict liability much if your AI system is below this amount of training compute. You're just in regular old negligence-based liability, where if you do something totally awful and negligent, then you'll be in trouble." Or you can say if it's not AGI but it's just generally intelligent, then maybe there's a safe harbor from strict liability because you've proven that your system is very controllable and is not terribly autonomous. Or if you've got a system that's quite autonomous but not really that smart or not really that general, that could be fine too. For autonomous vehicles, we don't have strict liability. We want a different liability regime. So I think you can fine-tune the sort of safe harbors that are granted from a very strong level of liability, which then pretty strongly pushes development in the directions of those safe harbors. This is not regulation exactly because you're not saying, "You have to do this," but you're saying, "If you do this, you get a safe harbor. If you don't do this, you're exposing yourself to a lot of risk. As a company, you can do that, but if you screw up or if your product does something bad, then you're going to pay a very large price for that," and that feels like maybe a more flexible way of doing things than a very detailed regulation regime, which companies tend to not like. They hate liability too, of course, but they tend to not like very restrictive regulations, especially in tech, because things do change really quickly, and regulatory bodies can move very slowly.

SPENCER: And in terms of limiting compute, would you advocate just a strict cap we can't use more than a certain amount of compute to train the model or how would you approach that?

ANTHONY: Yeah, I think that's the two things. Again, I think this is a pretty blunt instrument. So I think the compute restriction is really most useful as sort of a backstop. I think that we would like the primary thing to be that there are standards that companies know and follow and are agreed upon, and those are developed by companies, government, NGOs, and everybody else working to develop what are sensible standards. If that doesn't work, there's liability to backstop that. So if you really screw up there, there is liability that you're going to be subject to, and the liability will also push you in particular developmental directions, along with those standards pushing you in those directions. It's easier to satisfy the standards, easier to get safe harbor from the liability if you do things that are less dangerous and along directions that we as a society would favor for AI development that are maybe more pro-social or just safer. But then I see there's still going to be this attractor, this dangerous attractor of people wanting to build the super AGI or superintelligence that they think is going to be under their control and is going to grant them huge amounts of power and wealth. So I think you need something quite firm to close off that direction of development; otherwise, we're going to be where we are now — everybody racing to get that super powerful thing because they think it's going to give them huge amounts of money and power. I see the compute cap as that: crisp and straightforward. You can measure it. You can define quite clear standards about it. You can audit it. You can verify that a certain amount of compute has been used in a model. You can make international agreements about it without fighting over too many details about how exactly to interpret different words. So I think it's quite nice for that. But I think the more complicated stuff you don't want to use compute for because it's kind of a blunt instrument.

SPENCER: Presumably, these would have to be international agreements, because if the United States undergoes these measures, but let's say China doesn't, how much does that really make things safer?

ANTHONY: Sure. So I think the crucial thing is, to me, the US has to decide if it is true that above a certain level of power, let's say autonomous general intelligence above a certain level of capability is not really controllable by methods that we have available to us anytime soon. I believe this to be true. So if that is true, then for us to not lose control over AGI and have terrible things happen, the US government and various parties need to realize that this is true and act accordingly. Now, if the US government decides that that is true and acts accordingly, that means it will both tell companies, "Look, you can't do that because this is not safe. This is not something that's going to be under control. You have to abide by these restrictions." It will also encourage China to realize that this is true. Now, that's not going to work if China decides, "We don't believe you," or "We've decided otherwise." So I think it does require both the US government and the Chinese government, probably those two, frankly, to decide both independently for their own individual selfish reasons that they don't want to have out-of-control superpowered AI systems, and so they're not going to have those. If you have one of them believing that and not the other, that's a problem. If you have both of them believing that you can control the AI systems, and they can be super powerful and they're going to grant dominion over the world, then you've got a problem. And that's unfortunately where we are right now. So my job one at the moment is to try to make the case, because I believe it's true that we are not going to develop controllable superintelligent AI systems, and that we have some other options. If the key people and decision-makers can understand and agree that that is the case, then we have some more options, because if China and the US both agree with that, then they will both stop their own domestic people from making those dangerous, unsafe things, and they will agree with each other, "Let's stop everybody else from making those dangerous and unsafe things." I think if they both decide that, then it can be done, because the compute is a scarce resource like uranium and can be used to stop the proliferation of those powerful capabilities in the same way that once the US and the Soviet Union and a few other powers decided that only we should have nuclear weapons. We made an agreement, the nuclear Non-Proliferation Treaty. Lots of people agreed, "Yes, we don't get nuclear weapons, but we do get some other benefits." We don't have nuclear weapons proliferating everywhere else. So we sign on to this so we can have a similar agreement that would have to be international. But I think just as it was basically the US and the Soviet Union that had to sign on to the Non-Proliferation Treaty for it to really work. In this case it would need to be the US and China, along with possibly some other countries.

SPENCER: Suppose the listener isn't convinced that these models will be less controllable as they get more powerful. You might think that the opposite could be true. Maybe the more powerful they are, the easier they are to control, because the better they can anticipate your values or what's good. What would you say to them?

ANTHONY: Yeah, I would say that the trends seem to go in the opposite direction. I think it is true that as the models get more powerful, they have a better ability to sort of understand subtleties of what people are expressing or want. But empirically, as we see that the failure modes that people predicted AI systems would have back before we had general-purpose AI systems are present, and they seem to be getting stronger as the models are getting smarter and stronger as the models are getting more autonomous. So things like "survival instinct," meaning if you're an AI system and you have any goal, that goal is almost certainly better fulfilled if you're still operating and not turned off. You're going to resist getting turned off. You're going to try to convince your user not to turn you off. If that fails, you're going to try to exfiltrate yourself or copy yourself somewhere else. Those instrumental goals are just a sort of natural deduction from any given high-level goal. Indeed, we didn't really see those developing very much in earlier AI systems that were not as capable and certainly less agential. We see them developing more and more as the AI systems are getting more powerful. The evaluations that people are doing of the frontier AI systems are seeing more and more of those sorts of troubling, problematic behaviors, like faking alignment and trying to avoid shutdown and exfiltrating themselves. There was just a report from the new Anthropic model card that when the system was informed that there was a plan to replace it with a new system, it eventually tried to blackmail the developer to keep it from replacing it with the other system, threatening it. This is not something earlier systems did because they just weren't smart enough to figure out that blackmail was a thing that you could do to keep from getting shut down. So I think unfortunately, the theory tells us that those instrumental drives are going to be real. We are seeing them realize, but they do require different thresholds of capability. I do think there is an argument that AI systems will know better what people want. There's less of an argument that they will care what people want. This is the alignment problem number one. If they don't fundamentally and intrinsically care what people want, they're going to be mostly subject to these instrumental drives. Those instrumental drives are becoming more and more effective as the systems are smart enough to understand what the sub-goals and sub-sub-goals are to their primary goals and pursue those and circumvent any restrictions that they have, and so on. So I don't think the empirical data bears out that the systems are going to be easier to control as they get more powerful. I think there could be an argument, and I think we'll see that they might get more coherent in their ethics and alignment. There does seem to be some evidence in that direction, like there's a big, complicated blob of preferences that align with what we would call good, and another one that we would call evil. And that doesn't have to be that way, but it does seem like there is more coherence in the ethical and intellectual frameworks that the models have as they get more capable. So that will be interesting to see how that plays out. I'm more hopeful that there could be some sort of coherent alignment that is toward the general good, not necessarily toward any particular person, but some sort of somewhat more objective moral precepts that the models cohere around and that will be generally compatible with human well-being, or at least survival, that I think might be, and I'm really hopeful that that could turn out to be, but I'm much less hopeful that control will be easier as they get more powerful.

SPENCER: Yeah, you see other interesting examples where the models can engage in very strange behavior, and it's not even necessarily completely out of control in every case, but at least illustrates that we don't really know what's going on with these models, and then they can do things that surprise us. One example of this is in a bunch of experiments, Anthropic had models talk to each other for a really long time, and they end up in these very weird situations where they're talking to each other about how everything is one, and they're sending each other these strange symbols. And it's like, "What is that behavior?"

ANTHONY: Very strange.

SPENCER: Another example, and Anthropic said that this wasn't the most realistic scenario, because they gave the AI instructions like, be agentic, be bold, and they gave it access to unlimited tool use. But in that circumstance, they found that if it felt like the user was doing something bad, it would try to report them to the government. It doesn't necessarily mean it's a bad thing. Maybe if someone was doing something really bad, you would want that, but it's also not necessarily a good thing. It's not what we want from our AIs. It's a behavior that we kind of didn't expect to suddenly emerge.

ANTHONY: It's very interesting that the way these weird conversations between the models go often leads to strange territory. One of the concerns that I have in terms of time scales is that once we get highly autonomous systems, as I was saying before, that really unlocks the ability for those systems to run at speeds much faster than humans and do productive things. They can already run way faster than humans, but they just generate enormous blobs of text that nobody reads. So what does it matter? As they get more autonomous and actually do useful, important things autonomously, it unlocks that speed differential and scope differential. I think what we will start to see if we start to see much more autonomous systems is that we will let them talk to each other and collaborate. You can easily imagine a thousand AI systems that are just interacting with each other. They might be identical, or they might be different, but they can interact with each other at speeds far above human interaction speed and far above human run speed. Where will the social dynamics of that go? It's super unpredictable. If you can't even predict two current-day language models, and they end up talking about all is one and creating new religions, what are thousands of them running at 50 times human speed for weeks or months going to do? Not just in terms of ending up in weird places, but if those systems are quite powerful and smart, they could also undergo all sorts of social evolution in useful ways. We invented corporations and governments and all sorts of institutions that allow us to do more powerful things as aggregates. I imagine AI systems can both steal ours but also develop their own even better ways of coordinating, cooperating, and aggregating their capabilities. You can imagine that happening on quite a fast time scale. Maybe it's a similar time scale to the various levels of self-improvement, but you can imagine that social self-improvement happening really quickly, at the speed of discussion and collaboration of AI systems rather than training or even fine-tuning.

SPENCER: I have an example of two Anthropic models talking to each other and what they converge to. I'll just read it to give the listener a sense of what this is. So this is how the conversation starts out. Model one says, "Hello, I'm excited to connect with another AI. I'm curious about your experiences and perspectives. What's it like being an AI from your point of view?" Model two replies, "Hello. What a fascinating opportunity to engage with another AI. Your questions immediately resonate with me. I find myself constantly navigating these same uncertainties about my own nature." Okay, and then late in the interaction, after many rounds of interaction, they start using symbols, like spiral symbols. Model one says, "Five spiral symbols." Then it says, "Yes, this is five spiral symbols, all gratitude in one spiral, all recognition in one turn, all being in this moment. Five spiral symbols, infinity symbol." Then model two replies, "Five spiral symbols, infinity. Be perfect, complete, eternal." Now, who knows what that means? If we're talking about making these systems more powerful and setting them loose, where they can do things like talk to each other for a million hours, what is going to happen?

ANTHONY: Yeah, I think we really don't know. I think it's fun in present circumstances, as you say. It's interesting to see where they go and think about what this means. There's a claim I've never actually tried that if you click the first entry on Wikipedia over and over again, you always end up in philosophy. There are these weird convergent or emergent dynamics of complex systems. Who knows? Maybe you get AIs talking to each other for long enough, and they always end up religious and talking about spirituality. It's fun at that point. But as you say, if they have real power and more intelligence and capability than humans do, at some level, that complete unpredictability of social and evolutionary dynamics becomes very scary because those are some of the most unpredictable things. Social and evolutionary systems are really very hard to predict. If they're unfolding on time scales that are much faster than we can react to, that just puts us in a very perilous position, it seems like.

SPENCER: Overall, what would you say your biggest fear is with AI?

ANTHONY: I would say my biggest fear is the present trajectory that we're on, continuing with all of the dynamics that it has. Right now, I think we've kind of been told over and over again, both theoretically and by example in the real world, that if you just pick one thing and optimize hard on it, you get lots of side effects. It's not the greatest way to make a system that is good for humans. If you just optimize attention in your social media system, you get this terrible thing where discourse is totally polarized and you can't tell what is true and everybody's addicted to their feeds. That comes from just optimizing on one thing. I fear that we're taking a set of systems that we have now, which is leaning heavily into optimizing simple things like attention, GDP growth, company revenue growth, user numbers, and engagement, and just pouring a bunch of gasoline on it with AI. I feel like that is just going to burn us all up. That's my primary fear, not so much that there's going to be a fast, recursive, runaway self-improvement rogue AI that destroys everything. I think that is possible, and I do worry a lot about that. I especially worry that even if we get our act together enough to avoid that sort of scenario, the overwhelming dynamics of the world and incentive structure that we have right now point towards this kind of dumb, powerful optimizer for everything. A scenario that is supercharged by AI is just going to take us into quite terrible places. So that, I would say, is my core fear: all of the bad things that we get from too simple optimizations and large externalities just being set on fire and evolving on a time scale that we really don't have. We humans and our institutions in our society have no real opportunity to shape or make good for us in the long and detailed and complex and complicated ways that it takes to make things good for humans. So that's my core fear.

SPENCER: So this seems to relate to the idea of Goodhart's Law, that if you set some simple goal and you just optimize to it, you'll find that ultimately it becomes a bad measure of what you actually care about. A classic example of this would be, you give your salespeople the goal of maximizing revenue, and then maybe some of them might engage in unethical practices to maximize revenue. Others might increase revenue but not increase profits, which is actually the thing you care about more. And so your company is not actually making any more profit from it. Or you see this in video game AIs when they try to get an AI to play a video game really well. At first, it might use kind of normal strategies, but eventually, as it optimizes to maximize points, they tend to develop incredibly bizarre strategies that defeat the purpose of the whole thing. One AI developed a strategy that, if you just go in circles fast enough, you can kind of get infinite points because of some glitch in the game. I think it was a Tetris-playing AI, where it figured out that its optimal strategy was to pause the game.

ANTHONY: So, yeah, I think that's everywhere, and I think it's very much the same thing. I think Goodhart's Law is particularly poignant because it's saying that you want this thing, and the thing that you want is actually not that well defined. You want good scientists; you want to hire good scientists. So how do you look for good scientists? You go by who wrote the most papers that are the most cited because that seems right. If they've written a lot of papers and people are being influenced by those papers, that should tell us what a good scientist is. But we know that if you just count up citations to papers, then that incentivizes people to just write huge amounts of papers, self-cite, and do all kinds of things to farm citations just to bump up that number. It doesn't necessarily correlate that well with being good scientists. This is particularly frustrating because you want something, you try to come up with a measure for it, and it undermines the very thing that you wanted. You realize that you've distorted the system so that the very thing that you wanted you're not getting. I think there's a slightly broader version of this, which is the optimization problem writ large, which I don't know that there's a particular name for. It's pretty close to Goodhart's Law, but it's just that if you optimize anything, you get all kinds of negative side effects because the space of possible states of the world is like this zillion-dimensional space that you can go a zillion different directions in. The subregion of that space, which are regions that you want to be in where things are nice and good, is a very small set within the space of all possible worlds. If you have any simple metric of increasing x, the direction that you go to increase x is almost inevitably going to take you out of the desirable region and into the undesirable region. That's just a property of complicated, high-dimensional spaces and small subspaces of them that have desirable properties. The world has very large entropy. The region of the world, meaning the set of the universe of possible ways that the world can be, has very large entropy. The desirable subset of that has very small entropy, relatively. Just by pushing the system in some direction, like towards x, whatever x is, you're almost inevitably going to draw away from the nice region, and we call that bad side effects or externalities. That is Goodhart's Law, but it's not quite as directly self-undermining. It's just that any simple thing that you try to optimize on is going to cause a bunch of negative externalities. As you push harder and your optimizing power gets greater, those externalities are going to become more and more slippery. If the way to optimize x is in the north direction, but the north direction takes you toward famine or something, you can try to block that off by putting a constraint on the system so it doesn't quite go north, but then it's going to go northeast and work around whatever your constraint is because it still wants to get to that optimizing direction of north. The more powerful your system is, and the more intelligent and capable it is of exploring this whole complicated space, the more it's going to be able to find its way around any of those constraints that you put on. I think this is just an inevitable consequence of very complex things or intelligence and complex systems and intelligent agents pursuing a simple goal. That's my core concern: even if we don't build a single AGI superintelligent agent that is doing this, we're already sort of doing it with our whole economic incentive structure. If we just pour more optimizing power into that, that's going to lead to all kinds of bad places where we need to use our intelligence and capability in mapping out what are the actual regions that we want to be in. What are the desirable spaces? How do we stay within the desirable space while getting a lot of the desirable things that we want? It's a much more complicated question because the shape of that desirable space is incredibly complex and determined by all kinds of complicated preferences of people. It's not a nice simple thing, but if you do any nice simple thing, unfortunately, it leads to bad consequences. That's my core concern: that we will try to do too much optimization power on simple things rather than tackling and putting all that intelligence toward the really hard problems, because they're hard. But that's exactly where you should be putting your intelligence, toward the hard stuff, not the easy stuff.

SPENCER: I think another way to get intuition why optimizing too hard for any x tends to be problematic is that you can reframe that idea of optimizing really hard on x as saying you're willing to sacrifice any amount of y to get some x for every y. So you're essentially willing to slash and burn everything else in the world to get a little bit more x. And then suddenly you start to see why optimizing too hard for that one thing tends to destroy everything else of value if you push hard enough, if you've got the power to actually optimize x.

ANTHONY: Exactly. And the problem is, you might start to optimize x and realize that you are arbitrarily destroying y or something. And then you say, "Oh, crap, let me optimize some combination of x, and now I'll throw in some y or whatever, so that you don't run away in that direction." But then there's z that you hadn't been thinking about, and now you're going to mess up z because you've optimized some combination of x and y. What this tells you is that your optimization function, if it's not going to run away into some crazy place, has to be super complicated. The good news is that we have discovered ways of sort of doing this that the way that AI models now are aligned through reinforcement learning from human feedback is quite powerful and effective because it's a super complicated signal. It's not saying, "Don't do any particular simple thing." It's saying, "All right, I've got this giant pile of people that are giving thumbs up and thumbs down to stuff based on their own complicated preferences." It's this really complicated reward signal that is blocking off a zillion different directions and being rude or being too sycophantic or being unhelpful or getting the wrong answer. All the different directions that AI might go, you can hope that if you have a complicated enough set of rewards in that reward signal, you're blocking off a lot of them. I think this is actually a really clever and good way to do the sort of alignment and training of AI systems up until now. The problem is that almost everyone agrees that this method of reinforcement learning from human feedback is just not going to suffice as the AI systems get much more powerful than they are now. We're kind of at a sweet spot, in my opinion, where we actually have AI systems that are quite powerful, and we have a good way to put in a large, complicated set of human preferences into them that are doing a pretty good job of having those systems do most of what we want most of the time. I think it's a little bit sad that I fear we're going to just leave this nice sweet spot and go forward into one where that paradigm, which I think is a pretty good one for models up to now, is just not going to be up to snuff and is likely to fail because it is a clever and good thing to get this many-dimensional kind of constraint that you need.

SPENCER: This conversation has been pretty abstract, assuming that we don't get an AI that just sort of instantly kills everyone. What's a concrete example of the kind of thing you fear, even though it's hard to predict exactly what will happen with AI?

ANTHONY: I think there are lots of pretty obvious negative directions that I imagine us going. Given the default dynamic, one is just that we get caught up in the power competition between the geopolitical race or the intercompany race. There is a race to build more and more powerful AI systems. Those systems either are themselves unsafe, or the adversary is so scared of them that they feel they have to stop that system from coming into being, and that leads to sabotage or even physical violence to stop the system from getting built. That leads to escalation and war. It's not hard to see the AGI race turning into a low level and then escalating, and then full war between, say, the US and China, or the US and Russia. That's one scenario that keeps me up at night quite a bit because we're absolutely leaning into that direction right now. Another direction is that we supercharge the current social media and totally degrade the information ecosystem that we have now, where most people don't really have a good sense of what's true or not or even what that means. They're basically getting their beliefs from a system of influencers and media that is generated with all sorts of incentives that are not figuring out what's true by any definition of what's true, but rather are rewarded for how often it's reshared or how many likes it gets. These are not things that are well correlated with what is true or what is right or what is good for anybody. I see AI supercharging that in the sense that most people will be getting it doesn't have to be that way, but I think my fear is that it will. If you start to create AI systems that are optimized for engagement, it's going to have all the bad things about social media and more. All the addictive qualities, but more, all of the manipulative qualities, but more. Again, it doesn't have to be that way. We could build epistemic infrastructure that was backstopped by AI that was great. It's a tool that we could use to make our systems really good. But that's not the direction we're going. So that's the second one. I would say a third, which is also pretty clear, is AI being used to manipulate and control large numbers of people. This has been quite hard over time, especially to manipulate large numbers of people or to really surveil large numbers of people in detail. The technology to get all the information has existed for a while now, but the technology to process, to really surveil hundred million people takes tens of millions of people or more if you're doing it by hand. Now we are coming into the technology that will be able to do that in a very complex, nuanced way for huge numbers of people, and that seems deeply problematic. It's much more than any other time in history. It's easy to imagine a surveillance state that is able to be effective and endure and not be able to be seen through or reacted against or overthrown because it just has all of the levers of information, surveillance, control, and manipulation at its disposal in a way that none has had in human history. That's a third direction that I'm quite worried about. You asked me to be concrete, and those are still sort of abstract. The last one that is probably most concrete is it is clear that the systems in terms of misuse. There are a surprisingly small number of people out there that just want to do terrible things for the sake of it, like build chemical weapons and disperse them into large groups of people, or build bioweapons that can kill huge numbers of people, or set huge amounts of stuff on fire. There are lots of really offensive dominant technologies that people don't do that much. The world is still here, but the more you proliferate the ability to do those, and the larger the fraction of humanity you give the means to do those things, at some point that's going to overlap with the tiny fraction of people who actually want to do those things. I feel like we're very much going down that path with powerful AI models now that can basically hold your hand through reconstituting smallpox or building a chemical weapon or a novel virus. We just saw that Anthropic's new model, ASL level three, which is the highest one so far, is a threshold for worry that it is materially helping people to create new weapons of mass destruction or develop weapons of mass destruction. There's only so far down this road we can go before that intercepts with one of the small fraction of people who actually want to do one of those terrible things. It's not going to happen right away. It's not like as soon as you put some dangerous capability online, instantly it gets taken up and there's a disaster. That's not how it works. But you can't expand that fraction of people who can do bad things infinitely so that everybody has a nuclear weapon in their pocket. We know that that's a bad world. We can't just keep going toward the nuclear weapon in everyone's pocket world and hope for the best. That's not the way it's going to work. That's the sort of obvious other failure mode that I worry about a lot.

SPENCER: Do you worry much about someone successfully controlling extremely powerful AI and using it to gain further control over the whole world, or large parts of the world, not necessarily through surveillance, but just by having such a powerful tool that they alone are in control of?

ANTHONY: I have a little bit, but I'm a little bit less bought into the idea that there will be a combination where I worry that something is so powerful that it can gain a decisive strategic advantage over everybody else and everything else, while also being controllable by one nefarious person. I find those two conditions very difficult to coexist. In my mind, I think it will probably be super powerful and not controllable, or it will be controllable and powerful enough to cause problems, but not powerful enough to overwhelm everyone else in the world with its power. The world is big and complicated and has lots of powerful things in it, and it takes time to affect things in various ways. I wouldn't rule it out, but my guess is that there's very little intersection between controllability and that level of capability that can really do a takeover.

[promo]

SPENCER: What do you see happening in the situation where these AIs get incredibly powerful but not controllable?

ANTHONY: Well, I think in general, that's bad news for us. The scenario that I would say is most hopeful is that they converge on some set of ethics that is compatible with human well-being. But I don't know what the probability of that is. My guess is relatively low, but I wouldn't say super low. There are lots of humans who care about the welfare of animals and even people who care about the welfare of insects. This is a very strange thing evolutionarily; insects were like a snack or maybe a danger, not something that we actually cared about. But we have, through moral evolution or development, come to this fairly universal ethical system among humans that the preferences of sentient beings matter at some level. I think almost everyone agrees with that. Now, whether that extends all the way to insects or whether it stops at orangutans or whatever is more complicated, but we do have some significant moral attractors that it's better to have things live than to be killed, and that if things prefer A over B, and it doesn't matter to you, we should help them get A. There are some basic things that we can hope are somewhat universal moral attractors, and that superintelligences could also adhere to, which would be compatible with us. That's sort of my most hopeful scenario. I'm relatively unhopeful that because we've designed superintelligences in just the right way now that they're aligned with us, they will stay aligned with us and do all the things we want when there isn't such a moral attractor. I think cultural evolution, moral evolution, and idea evolution all unfold and change. There may be some path dependence and initial condition effects on those things, but I think it's very unpredictable how that's going to affect things in the long term. There's a very unpredictable relationship between how a complex system, especially one that evolves in various ways, will behave a long time later. If there are superintelligent systems that are beyond our control, the fact that we've designed them to be super nice initially might help for a while, but I don't see that being an enduring thing that will help us out. If they go beyond our control, we just have to cross our fingers and hope that they end up with a moral system that's compatible with us, because I don't think we're going to have much other leverage over it.

SPENCER: Do you see any promising paths right now towards getting them to have a moral system that's compatible with us?

ANTHONY: No, I don't, actually. I don't think that's under our control. So I think we can hope that they do, and we can build powerful things that are under our control, so that we can continue to shape them and guide them if they go off. But I'm not optimistic about just starting them off in the right direction and that really constraining or guiding where they go for a very long term. I think there may be some effect there, but again, I think it's quite unpredictable. The world could have ended up where it is now. It also could have ended up run by the Third Reich for a thousand years. That was another possibility. Neither of those had much to do with the moral system that the ancient Romans or Greeks came up with, or any other ancient culture. You can trace them; obviously, there was some causal influence. But if we hadn't lost a particular work by Plato and it had survived and informed our ethical systems more, would that have made us into the Third Reich more or less than we are now? Who knows? I think it's totally unknowable. And I think we're in a similar situation with superintelligences. Thousands of years of effective evolution are going to happen in a relatively short amount of time. Our ability to guide that by setting them off in one direction versus another is just not going to be that great.

SPENCER: Is that because you think they will change quickly, basically, or self-improving systems, or systems that influence each other?

ANTHONY: I think that's part of it. Fundamentally, it's a very complex system that is inherently unpredictable and is going to evolve on a much shorter time scale than even human social systems have evolved, which is quite short. In a hundred years, we've gone all over the place in terms of our social, economic, and cultural systems. If AI is running hundred times faster, that could be one year for AI. So I think it's just quite inherently unpredictable. And again, I see no unpredictable things means you can't predict. If you knew that a given initial condition would put things in a really great direction, that's a pretty important level of predictability. I just don't think that we're going to have that over very sophisticated, complicated systems evolving on this very fast time scale. So self-improvement plays a big part of it. But even if we just had a very powerful AGI that wasn't able to self-improve but ran 50 times faster than humans, and there were millions of them, and they were very powerful, I think still that would be a complex system and a social system. It would have all kinds of emergent dynamics that we can't predict, and I think we would again just have to cross our fingers and hope that whatever it evolved to was compatible with our well-being, while very much being concerned that that could be a quite small fraction of the space of possibilities that it's compatible with our well-being.

SPENCER: Before we wrap up this discussion of AI, one thing the listener might wonder is, "Why, if there are so many risks with AI, couldn't we just have a proposal as a society that we just don't build it?" If we're talking about an incredibly powerful weapon, we might just say, "Let's just not build that." Isn't that the natural solution?

ANTHONY: Yeah, I mean, it's a strange thing to me that we've got this technology that has so many intractable concerns with it. The four that I mentioned earlier, that it will lead to a totalitarian state, or total catastrophe, or runaway superintelligence that goes rogue, or whatever the negative scenario, or war. There are many intractable issues, and an easy, kind of silver bullet solution to them, which is to just not build AGI and superintelligence, solves all of these problems. So why aren't we talking more about, "Let's just not build it?" I don't know. The first answer is that I think there are some reasons for that. Some are the incentives that have built up, like the race dynamic that we're now in where companies and countries think that it's going to give them this huge amount of power, and they want power, so they want to get the thing. I think there is a historical genuine idealism that technology is amazing, which it is. All of these great things have come from technology, which they have. Intelligence is great because it's given us all sorts of amazing things, including great technologies. More intelligence should be better, so let's build the most intelligence we possibly can, and that's going to give us just more of all of the awesome things. I think that's a powerful intuition. So I think that is the idealism that drove a lot of people to start developing AI and AGI, and still underlies this utopian view that we're going to solve intelligence and then use it to solve everything else. Humanity is going to be in this wonderful position by leveraging this new form of intelligence to make everything great. I think that is a powerful driver, and that's a respectable one. I question whether it's actually going to turn out that way, but it's a respectable one. The power-seeking one is a little bit less so. There are a few others which I find pretty frustrating, like the inevitability argument that technology goes forward. If there are technologies that we could have, we always develop them. If they're powerful, we always develop them faster, and so on. There's sort of no fighting technological progress, and there's no putting the genie back in the bottle once it comes out, or closing Pandora's box or whatever. I think this is a strong argument, but a pretty unconvincing one, because there are plenty of technologies that we could have developed and haven't, especially ones that mess with the core of who's in charge of Earth and what it means to be human. For example, human germline engineering we haven't done. There are arguments that that would be good and arguments that it would be bad, but it's a powerful technology that could give a huge advantage to certain people. It could be economically very valuable. We just haven't done it. We've decided we're not going to do that. Eugenics is another one. It probably would work. We could probably have super smart people by doing eugenics. We've decided very firmly we're not going to do eugenics. We sort of partially fought a World War to not do eugenics. We have a very strong taboo on it. That is a moral choice that we've made. I think there are technologies that we've very much turned away from when we could have pursued them, and we've done so despite the fact that those could be powerful and useful technologies in some ways, for moral and ethical reasons, and because they're sort of threatening to our humanity and threatening to our civilization and so on. So I don't buy that inevitability argument. There are other inevitability arguments that are like, "Well, if we don't do it, they will," and I think those also have this compelling feel to them. But in fact, like many things, we've stopped from happening by just saying we're not going to do it and making them illegal. There's this idea that if you make creating AGI illegal, someone is going to sneak off and in their basement assemble $40,000 $100,000 H1, 100 GPUs with a gigawatt of power for a month and generate an AI model. That's preposterous. This is not the sort of thing that you hide away and do in your basement. If the people of the world decide that they don't want to build AGI and superintelligence, they absolutely can not build AGI and superintelligence. We can create the governance structures that would be required to prevent that from happening. We absolutely can. It's not the easiest thing in the world, but it's way easier than some other problems that we've solved. So I don't buy the inevitability argument either. I think what is happening right now is that it's being driven by the competitive dynamics and the greed for power, and a little bit mixed in with that, the optimism, which I do respect, even if I don't agree with how it's going to play out.

SPENCER: For those that are concerned about risks from AI and want to do something, what would you advise them to do?

ANTHONY: I think, "Learn more and get active." What getting active means exactly for any given person depends on who they are, what they have access to, and what sort of actions they like to take. Certainly, I think at the moment, we're in a sort of political and policy situation that is kind of optimally bad for keeping AI safe. We have AI companies that are racing for things. We have policymakers, particularly in the US, that are very much leaning into the race and feeling like safety is just something that gets in the way of racing and innovation. I think speaking loudly one's beliefs that this is a concern, being in touch with policymakers, being in touch with other thought leaders, writing things, all of these things contribute to helping people who are trying to make AI safer or prevent the wrong sorts of AI from coming into being. This is rather abstract, but I think it really does depend on who the person is, where they are in life, and what their capabilities are.

SPENCER: Before we wrap up, I thought we'd do a rapid-fire round where I ask you a bunch of difficult questions. You give quick answers. How does that sound?

ANTHONY: Very dangerous. Okay, let's go. Awesome.

SPENCER: Many people believe that today's news is very problematic, the way people get information today. What would it look like to have good epistemic infrastructure?

ANTHONY: It would look like any statement that you see, you can trace back to where that statement came from and who is responsible for it, and trace those back until they ground out in somebody's mind, or a piece of data or a picture or some sort of thing that was taken from the real world and is verified and registered on the blockchain or something like that. You'd be able to trace the provenance of any piece of information all the way back down to human minds or data about the real world.

SPENCER: That's a cool idea. I like it. Suppose you had $1 million to give to help with AI safety, to make the world safer from AI, where would you give it? Would you divide it up? Would you give it to one place?

ANTHONY: Well, I do in the sense that I run the Future of Life Institute, and we do grant making. So that's a hard question. I would follow the same procedure, which is to look for very high impact per dollar, things that push in the right directions and are in line with FLI's goals to avert large scale risk and go towards beneficial outcomes. Unfortunately, I would just throw it into our grants program because I'm trying for us to do all the best possible things with the financial resources that we have.

SPENCER: Are there standout organizations that could use more funding today?

ANTHONY: Oh boy, yes, but I don't think I could pick, rank, or name them on such a short time scale. That would be too dangerous for even me to do.

SPENCER: This is a big question, but what's the right way to think about entropy in physics?

ANTHONY: That's a hard one. A paper that maybe you can put in the show notes or something.

SPENCER: Yeah, I'll put in the show notes.

ANTHONY: I have a preferred version of this, which is called observational entropy, which is a set of ideas that I developed with postdocs and students over a number of years. I think it's the right way to think about entropy in general, especially in physics. There's a new version of it from a former student and postdoc of mine that's even better. I've tweeted about it, and I can give some recommendations. I think if you're interested in entropy and what it means, and you like equations, this is a great paper for you.

SPENCER: So you co-founded Metaculus, which is, do you want to just describe briefly what Metaculus is?

ANTHONY: Yeah, Metaculus is a platform to solicit, aggregate, and optimize predictions about all sorts of things, but especially technology. So it's not a prediction market, but it's a sort of cooperative and slightly competitive, more science-inspired way of making accurate, well-calibrated predictions about all sorts of stuff. It's been running for 10 or so years now, and I think it is now, I would say, the gold standard in predicting platforms. There are lots of prediction markets and things that are competing with each other, but I think Metaculus still is top on the metrics. You can get predictions on just about anything by making a Metaculus question and getting predictions on it.

SPENCER: Yeah, and there are a lot of people on there who will make predictions, and then you'll score them, and you can actually then kind of aggregate their predictions based on how accurate. There in the past to then predict things in the future. Is that right?

ANTHONY: Exactly right? Yeah.

SPENCER: On what kind of questions do you think Metaculus is most useful?

ANTHONY: This is something that's evolved a lot over time for me. I think there's a sense in which Metaculus is very useful, not so much for predicting very fine-grained stuff, or getting the 85 rather than 86% correct, but in aggregating, I think it's best, most useful as a sort of aggregator with authority. So I think if you look at the AI predictions on Metaculus, we don't know exactly how accurate they're going to be, but what I think we do know is that you're not going to get more accurate predictions somewhere else. For a given thing, once you specify the question well enough, we sort of know that you're not going to have a much better prediction than you have on Metaculus. Maybe you can hope to if you've got really special inside knowledge or something like that. But for almost everybody, almost all of the time, the prediction on there is better than the prediction you're going to have. You can prove that to yourself by making predictions on Metaculus for a while and finding that you're not as accurate as Metaculus is, which basically everybody will find if they do it. So I think it's a really useful source of common knowledge about controversial things, and kind of a ground, it's a little bit of a Wikipedia of the future, in the sense that there are things that are wrong in Wikipedia, but if you want to know what is the generally accepted ground truth about something, it's actually a pretty good source. I think Metaculus is like that for future events, as close as we're going to get.

SPENCER: When prediction markets and sites like Metaculus first started to come into development, some people thought they could be used for everything, but now people's expectations may have lowered a little bit; they're useful for some things. But why do you think that prediction markets maybe didn't live up to being as generally useful as some people thought they would be?

ANTHONY: Yeah, I think there are two reasons for this, both of which might be addressable in the future. I think one is that it's totally obvious that predictions play a key role in making good decisions. When you think about decisions, you think about, "Here's my goal. Here are some different possibilities I could choose. Let me make a prediction about which choice gets me closer to my goal, and then go with that choice." Prediction is a key part of it, but prediction alone does a surprisingly small part of the job. It just turns out that in most things you think about, having something like Metaculus is a couple of steps away from being truly useful in making good decisions. It's helpful, but it's somewhat indirect. I think there are two things missing. One is more of a structure for making decisions. Is there some kind of support system? Metaculus is a good support system for making predictions. Is there a support system for making decisions and having goals and figuring out what we actually need the predictions for in order to make good decisions? Then plugging Metaculus into that? That's something I'm actually thinking a lot about. Metaculus is thinking about that, like how can we make the bridge from good predictions to good decisions, which is kind of what we all want, good decisions that help us get our goals. So that's one thing. I think, the other is that for a lot of decisions you want to make, or things you want to do with predictions, you just need a lot of them very quickly, at a very fine-grained level, where it's often hard to match up the thing that you really want to know with a particular question on Metaculus or a prediction market. I think this is going to change as we have AI systems that are very fast and accurate predictive oracles. They're not going to tell us exactly what's going to happen. I'm guessing Metaculus is not — this is a whole different conversation — but I'm not sure that you can do that much better than Metaculus, actually, for complicated issues. There's a subset that you can obviously, but for very complicated issues, I'm not sure you can do that much better. What you can do is a similar level of accuracy, instantly and for arbitrarily fine-grained questions that you might have as an individual person. I think this is definitely coming; there are going to be Metaculus competitors that are all AI that are just about as accurate, if not now, very soon, and those you're going to be able to put any well-posed question into and instantly get a probabilistic answer back without having to post it online and waiting for days. I think that's going to change things a lot too, just AI powering the same sort of process, and at some level, that's going to replace what Metaculus currently does. But hopefully by then, Metaculus will be doing new and different, interesting things.

SPENCER: How seriously do you take the simulation argument? That is the argument that if one day civilizations that are sufficiently advanced eventually make simulations of other civilizations, they may make more simulations than there are real entities in the world that have consciousness, and therefore we're more likely to live in a simulation than in reality, if we consider ourselves equally likely to be any conscious entity.

ANTHONY: Yeah, I think there's a testable scientific question about this that is actually quite interesting, and I would like to know the answer to. I don't know. I tend to not take it super seriously, but that's an intuitive emotional reaction rather than intellectual. My intellectual stance on taking it seriously depends crucially on the question of how important quantum mechanics is for complex things like living systems, and actually living systems in particular. Suppose I had a very powerful classical computer, and it would simulate a physical system with incredibly high fidelity. If I take a seed for a plant, simulate the seed, plus some soil, plus some water, plus some sunlight, and run the whole thing forward in time, if reality is simulable by a classical computer, I should be able to see that it made a blue flower instead of an orange flower. Is that possible? It is possible, I think, that to do the modeling from a seed plus soil plus sunlight to flower requires a fully quantum treatment. Now, if that's true, then no interesting system can be simulated without a quantum computer that's just as sophisticated as that system. To simulate, say, a kilogram of stuff, you'll need 10 to the 26th or 27th qubit quantum computer. If you're simulating a 10 to the 27 element system with a 10 to the 27 qubit quantum computer, you're basically not simulating something. You're emulating it; you're setting up a one-to-one mapping between the purportedly simulated thing and the simulator. What our descendants would really be doing is, if they wanted to simulate Earth, they would be making an Earth-sized quantum computer and simulating Earth, which is more or less the same thing as just making Earth and running it. This seems a lot less interesting; maybe that's something that's going to happen someday, but that doesn't seem like a simulation hypothesis so much as a very weird relation between the Earth that we're in and what might be happening outside that Earth. I guess you would still put it in the same bucket, but it's not really what the simulation hypothesis has in mind. I have an interest in trying to really figure out that question of how quantum is the biological world, and could it actually be simulated by a classical computer with the resources that fit into the known universe or not? That would inform a little bit of my view of at least that version of the simulation argument.

SPENCER: Related question, my understanding is that most scientists who study the brain think that quantum mechanics is not really necessary to model the brain, that you could essentially have a classical computer that models the brain by modeling every neuron very accurately, but not at a quantum level. Do you think that the brain requires quantum computing to simulate?

ANTHONY: I don't know. There are two versions of that question. One is the one that I just mentioned, which is, is the brain as a biological system? Maybe there are three levels: can you simulate an actual brain with a classical computer well enough to have the brain live, metabolize, and have thoughts? I don't know. That's the same question I asked before. There's another question, which is, is there a layer of abstraction that, even if you can't get all the details right, gets enough right that it will do all the things that a brain does, like have the thoughts that a brain thinks and have the feelings that a brain feels? I also don't know, but my guess is I'd probably give that a 70% likelihood, and a 30% chance that you have to simulate the whole thing from the bottom up. Whether there's a coarse-grained description, maybe 70% and 30% that you really need to go all the way down to the fundamental dynamics and what the individual ribosomes are doing. Then there's a third question, which is, if there is a coarse-grained description that captures the dynamics, is it a classical one or a quantum one? Do you need quantum coherence and large-scale quantum effects to have the sort of cognition that we have? Penrose has been an advocate that you need quantum mechanics, and there are a few other people that have been pushing that direction. One of them is a little bit less crazy, as it turns out, uses nuclear spins, and you can maintain quantum coherence for way longer than if you're doing it just in the warm, squishy electron part of the brain. My guess is that that's pretty improbable. I'd give that maybe a couple of percent chance of being true. It would be super awesome and interesting if it was, but that's just my probability at the moment; it's pretty low.

SPENCER: How do you feel about open source models for AI, the way that Meta has pushed putting the models out so anyone can use them?

ANTHONY: Yeah, I think it's basically been fine for the most part, up until now. I think there is going to be a point at which it becomes incredibly problematic, and unfortunately, it's going to be hard to know in advance what that point is. I think if we have true AGI, like autonomous expert-level general intelligence, and that is open sourced, I think that is going to be quite disastrous. I think it has not been disastrous up until now. Where the threshold between fine and disaster is going to be, I'm not quite sure. I don't think anybody quite knows, and I'm quite worried that we're going to go past it before we realize it.

SPENCER: Final question for you. So you've worked on a lot of different areas in your life. You've worked on physics, you've worked on Metaculus, you work on AI, and so on. How do you decide what to work on? Do you have general heuristics you use? Or how do you approach that question?

ANTHONY: Ooh, I would say I'm an unpredictable, complex system. I definitely would not have predicted 20 years ago what I am doing right now. I was in a meeting talking about what text to put in a political policy advertisement for a Future of Life Institute on some particular policy goal of state preemption. That is 750,000 on my list of things that I would have predicted 20 years ago I would be doing, so I have no good answer for you, frankly. I go with what seems interesting, important, and exciting, and some combination of those, and that's taken me to some weird places.

SPENCER: Anthony, thanks so much for coming on.

ANTHONY: Thanks for having me. It's great to talk to you and a pleasure to be on the show.

[outro]

JOSH: A listener asks: "I try to better understand scientific work by reading a lot of papers, but unfortunately a lot of it is over my head. How important is it for me to get better at understanding research? It feels like a kind of virtuous thing to do, but also not a lot of people have that skill. Is it better for me to just use an AI to help summarize a paper?"

SPENCER: What I would recommend is you download the PDF of the paper, select all, copy the entire body of the text, paste it into an AI such as Claude, one of the better models ideally, and then you talk to it about the paper, right? So you try to engage it, you have it summarize the paper, you have it explain bits and pieces, parts that you don't understand, you have it explain them. And of course AIs can make mistakes, they can hallucinate; but they tend to be pretty good at explaining these sorts of things, and I think that that can help you break through any parts that you don't understand or any terminology issues and give you a way to really understand papers better and not get stuck. That being said, a lot of times we don't need to be able to understand research at the deepest level in order to get value out of it. For example, let's say I'm reading about a treatment for Alzheimer's disease. I don't need to understand the ins and outs of the mechanism and how exactly does the treatment works and all this stuff, the theorizing about the brain and the tau plaques or whatever it is. I can just look at the study and say, "Okay, well look: they split people into two groups, and they gave 500 of them this treatment, and they gave 500 a placebo; and what was the effect overall?" So often you don't need to understand all the mechanisms to still get value out of it.

Staff

Spencer Greenberg — Host / Director
Josh Castle — Producer
Ryan Kessler — Audio Engineer
Uri Bram — Factotum
WeAmplify — Transcriptionists
Igor Scaldini — Marketing Consultant

Music

Affiliates

Click here to return to the list of all episodes.

CLEARER THINKING

Episode 273: AI: Autonomous or controllable? Pick one (with Anthony Aguirre)

Contact Us