with Spencer Greenberg
the podcast about ideas that matter

Episode 194: How can AIs know what we want if *we* don't even know? (with Geoffrey Irving)

Enjoying the episode? Want to listen later? Subscribe on any of these apps or stores to be notified when we release new episodes:

January 25, 2024

What does it really mean to align an AI system with human values? What would a powerful AI need to do in order to do "what we want"? How does being an assistant differ from being an agent? Could inter-AI debate work as an alignment strategy, or would it just result in arguments designed to manipulate humans via their cognitive and emotional biases? How can we make sure that all human values are learned by AIs, not just the values of humans in WEIRD societies? Are our current state-of-the-art LLMs politically left-leaning? How can alignment strategies take into account the fact that our individual and collective values occasionally change over time?

Geoffrey Irving is an AI safety researcher at DeepMind. Before that, he led the Reflection Team at OpenAI, was involved in neural network theorem proving at Google Brain, cofounded Eddy Systems to autocorrect code as you type, and worked on computational physics and geometry at Otherlab, D. E. Shaw Research, Pixar, and Weta Digital. He has screen credits on Ratatouille, WALL•E, Up, and Tintin. Learn more about him at his website,

Further reading:

SPENCER: Geoffrey, welcome.

GEOFFREY: Thank you.

SPENCER: As AI systems get increasingly powerful, there's more and more talk about the importance of aligning these AI systems. Let's start the conversation with what I think could be a really important topic, which is: what does it really mean to align an AI system and why do you think that's important? Let's begin there.

GEOFFREY: Sure. I think there's a vague common sense notion of, "have an AI system do what humans want." And mostly, the task of AI alignment is figuring out what is the breakdown of formal versus informal means that attack that definition, but I think that's the thing to keep in mind: it does what we want. Of course, 'we' is complicated here as well.

SPENCER: And 'to do what we want,' let's start unpacking that a bit. What kinds of things would an AI that's very, very powerful need to do in order to do what we want?

GEOFFREY: One question is, what kinds of things would you like it to do? And I think there's some wisdom involved in biasing towards assistant-type tasks: having it help us get information or help us do things rather than act in some fully autonomous manner. I think part of the task is picking the class of tasks such that, if it does a good job, then good things have happened. And then hopefully, you've chosen that where there's enough of a well-defined notion of what good behavior is that you can tie it down approximately. I think you won't get a very precise definition, but the hope is that you have enough slop in the definitions that imprecise definitions are okay.

SPENCER: Historically, I've seen attempts in the past to say, "Well, we can make AI systems safer if we make sure that they are just tools that assist us in doing something as opposed to doing it on their own." But it seems to me that this starts to blend with doing things on their own when you have to actually do something in order to assist. For example, let's say someone wants to know, well, how do I set up a bank account at such-and-such bank? Well, maybe the tool has to go read the documentation or the website for that bank in order to actually be able to assist with that task. Now maybe reading that documentation is not a big deal. But what if it had to take further steps in order to assist? What if it actually had to create a new account itself in order to figure out what the steps are? Well, now it's starting to seem more like an agent even though it's just acting as an assistant. So how do you think about distinguishing the idea of being an assistant or tool from being an agent that's acting autonomously?

GEOFFREY: I think there's two steps: one is to imagine what you would do as a human if you had enough time or enough resources (or whatever) to do all of the things yourself. You can imagine trying to define what that would look like, and then you can try to get the machine to behave as if it is just helping you do that. An example would be, if a machine is doing a bunch of small tasks, say even a GET request to a page might be an action in the world because it has some side effect, but also know that that action is safe and fine. So you could imagine, instead of the agent actually doing the task, it just tells you every single task that it is going to do. And that would be incredibly slow and boring, but that's one step towards, "Okay, now I can just look over the task, I have some way of supervising an informational answer," assume I've solved that problem, and then the next step would be: you can have the thing simulate, well, what if I were to ask someone? Would the person have endorsed the action? And then if I'm sufficiently confident, I can take that step automatically. And then hopefully, you can build up towards an imagined idea of pure information to some notion of autonomy in the middle of that.

SPENCER: Just to make this really concrete, let's go back to this question of asking the AI to help you figure out how to set up your bank account. Now, instead of going and setting up the bank account itself — which is a little creepy, to have an AI try to pretend to be human and set up a bank account — you might just have it output the list of steps it would take to set up a bank account. But it seems like, very quickly, you're gonna run into a problem where it may not actually know that unless it tries to do it. Some of the steps are gonna be hidden and it actually would have to try to do the steps in order to see what the next step is, and so on. So I'm wondering if the usefulness of the system is fundamentally limited unless it lets it take some of these actions.

GEOFFREY: That already has some side effecting action because it involves logging into the bank account website, say. And you can chain these things together, make them look like a series of informational actions and then again, try to short circuit, if I'm super confident someone is going to endorse this, I can not ask them. In some sense, what I would like to have is a philosophical notion of, just give people information, but then make it practical and bend it a bit. We're keeping close enough, as close as we can, to the notion of just informational assistance.

SPENCER: Suppose it outputs a series of actions that it wants to take to help you, to assist you. And then it applies some model that it has trained to say, "Well, would a human want me to take this action or would it want to pre-approve it before I take it?" And then if it's a really high score — yes, the person is likely to want me to take this action — it will go take it. Is that the idea?

GEOFFREY: Yep, that's right.

SPENCER: Where do you see this idea starting to hit walls? Do you think that an idea like this could actually take us all the way to developing extremely powerful, yet still safe, AI? Or do you think that there's gonna be roadblocks?

GEOFFREY: The main place that it hits a wall is that first step of trusting information from an agent, though most of my work is on that step, on just trusting the information, because even that step is very hard. And again, most of the philosophical difficulty, say, is in that first step. The next step, I think, then runs into a bunch of practical challenges, like you have some model of whether a human would endorse that; is that model accurate enough? And there's a variety of other problems in those steps as well. But maybe the biggest one is just in trusting information.

SPENCER: Is this related to the idea of these AI systems hallucinating, where they'll make up information but say it really confidently?

GEOFFREY: Yeah, a hallucination is maybe one type of factual error. You could also have a machine that's actively deceiving you. You could have it retrieving information that's just wrong; it got it from somewhere but the source is incorrect. There's a couple of different reasons for these factual errors. Also, it could be that, if you have humans as training signals, the humans are going to make mistakes as well. And so it could be that you've trained the environment where it produces only output that looks correct to humans but that isn't actually correct; there's some subtle flaws. So hallucinations will be one piece of the story and maybe not, I think, the most existential risk-related piece, because we expect that to improve as the models get stronger.

SPENCER: Before we get more into the idea of alignment, why do you think this is really important? What's your pitch for that?

GEOFFREY: I think these systems give us two reasons: one is, there's a good chance that these things will get fairly strong, the capability of the systems is increasing quite rapidly. There's a fairly strong economic incentive to use these things, both for informational tasks and, as you pointed out, for quite autonomous tasks or semi-autonomous tasks. I think you should expect that, assuming the system's gonna advance, people will try to put it to use in these various ways. And then I think we don't have enough confident solutions to alignment, either for the informational case or the action case. So if we expect this to happen — as soon as they get strong — and we don't trust the alignment schemes just yet, then I think we should work on closing that gap.

SPENCER: How much do you worry about bad actors using these systems to do bad things that maybe can be enhanced a lot by intelligent AI in ways that humans couldn't do it, versus the systems themselves being used for things that are benign or even helpful, but going off the rails because they misunderstood the intention of the user?

GEOFFREY: I think they're just both important risks. I work most of my time on the pure alignment case. Although I think certainly, if you do try to harden systems to malicious attacks — and some of my work has been on that as well — then that addresses some of misuse. I expect it, with the relative importance of (say) accident risk versus malicious risk, I don't have a strong view on, although I mostly work on one of them. That's a long-winded way of saying they're both important.

SPENCER: There's this funny tension where it might seem, the more intelligent an AI system gets, the better it is at following our instructions and doing exactly what we want. If you take GPT-2 and you try to get it to generate a poem, it's going to do much less of a good job, most of the time, than GPT-3, which will do less well than GPT-4. But there's this other sense that people worry about, that these systems may be more likely to make certain types of really subtle errors where, maybe as systems get sufficiently advanced, they might take a whole bunch of crazy actions you didn't expect when you give it a relatively benign inquiry. For example, if you have a chess-playing engine, you don't have to worry about it manipulating you or lying to you when it's giving you its moves; it's just not within its capability set. Whereas if you had a really advanced agent, maybe through training procedures, it could actually develop an incentive to lie to you and now suddenly, there's this completely new class of ways that trusting it could go badly. I'm just curious to hear your thoughts on that tension, whether you think that's a real tension.

GEOFFREY: I don't think there's an actual tension there. I think, as the strength increases, you have both the ability to follow instructions better and at least a latent ability to find and exploit any holes in the instructions, or holes in the reward signal, if you're planning some complicated reward signal. And so I think the question is not even which of those holds; it's that both of them can hold at the same time. And you both get more and more advanced in following behavior — more and more useful behavior — and more and more of this potential loophole exploitation. The other thing is that, even if we only appear to see usefulness, I still want to know why the thing is useful in the way it is. So you can imagine a system; let's say it's generating a poem and the poems get better and better and better and more and more emotionally salient. We don't really know why or what strings they're pulling; they're just increasingly emotionally manipulative in some subtle way. And so they become addictive and, at some point, I want to understand what is going on, or I think that there could be bad consequences down the road.

SPENCER: This reminds me of OpenAI having, as one of their approaches to try to make systems aligned, using this idea of AI debate that, while it can be hard to directly evaluate what an AI says if you're just a human — you don't have access to all the information — maybe you can get a debate between AIs and maybe it's a lot easier to judge that. But I worry that that kind of approach could lead to optimization against the human brain where it turns out, the best way to debate is to make you feel like you're a bad person if you believe X or something like that, not to have the most factual approach. And so I wonder about that.

GEOFFREY: I will probably look for those answers. None of the authors of the paper are at OpenAI anymore so I'm the first author, and the authors are respectively at DeepMind, ARC and Anthropic, for clarification. The thing I would say about that argument is that, in some sense, there's two things going on if you switch from (say) human feedback training to training with some kind of strengthened protocol like debate or a family of similar schemes. One thing is that you could imagine that the machines, the reward environment is richer so you could incentivize stronger behavior. And the other is that you have more ability to catch flaws. I think if you hold the capabilities of the models constant, my intuition is that debate is an improvement, and you would already get that exploitation of weakness, exploitation of human manipulability or something, if you just did direct human feedback training. And so actually having this adversarial setup where another model is complaining about the first model's behavior, gives you some strengthening. It's also a richer environment to train systems so you have this balance between... maybe there's an advance in capabilities you can get out of those setups in exchange for safety at a fixed capability level.

SPENCER: Could you elaborate on the ways that these systems might actually have an incentive to manipulate us in the standard non-debate scenarios?

GEOFFREY: Sure. You just imagine the model is spitting out an answer and then maybe it spits out an explanation. Let's do that setting. So the answer could be something that's emotionally salient and the explanation just amplifies that effect. And the human is riled up by the emotional manipulation, just from a single, non-adversarial answer. So in the same way that someone can sell something to you by just telling you an argument for something, you can get that manipulation out of non-debate, but it's there in the baseline.

SPENCER: For those who are not as familiar with how these models are trained, can you explain how that goes into the training feedback loop, and why it might actually be incentivized to do that in training?

GEOFFREY: The typical way that MLMs are trained these days: first, they're pre-trained on a bunch of internet text. And then they're fine-tuned with reinforcement learning from human feedback, which is basically, you just repeatedly have them talk to a human in some way, and the human scores them for behavior. Possibly a different human scores them, but the details vary. So you're iteratively showing their output to a human, having a human evaluate it, and then you often build a model of what the human evaluation is, and then training into that model. So if the machine learns that humans are manipulated by a certain emotional trick, or by hiding lies in a subtle way, then it can learn to iteratively exploit that trick.

SPENCER: Do you see evidence of that happening already? Is there a class of responses that these LLMs are giving that seem to be manipulating the end user?

GEOFFREY: I think the very first time we tried it, it happened in innocuous ways, but I think these can snowball over time. So in the first work we did on LHF when I was at OpenAI before, we had a word limit on the model's output, but humans like long answers — it sounds smarter if you use more words, and it also sounds smarter if you use fancier, longer words — and so the thing always hit the word cap; it would use as many words as it could, but also it would choose tokens, choose word pieces that had more characters in them, so that the resulting sentence was even longer than the word limit would otherwise allow. That sounded really fancy and convinced the raters that it was intelligent. And if you correct for that artificial gap, it loses a lot of apparent performance. So it was already fooling people just by using fancy-sounding words.

SPENCER: How do you adjust for that? If your human raters actually genuinely prefer it to use fancy-sounding words, that's actually the way it should optimize objective function. But somehow that objective function is not capturing your values as the creator of the system. What do you do about that?

GEOFFREY: I think you have to spend a lot of time thinking about the humans. Basically, you need to engineer this human system that, if humans are the reward function, then a lot of the work of alignment will be, in fact, engineering this human protocol to produce higher-quality information. And I expect it to be a mixture of machine learning and careful understanding about humans' work, careful experiments to learn qualitatively what's going wrong with experiments. In the case of these long words, you could try to just point that out to people in the course of data gathering or should have shown them instructions to try to correct for it. Basically, my progression as a researcher was, first, on these kinds of theoretical schemes working out ASAP, and then there's gonna be a bunch of human complexity so we should hire cognitive scientists and so on. And then there's a lot of messy details.

SPENCER: It seems to me that there's a few different ways understanding humans goes into trying to build AI that works really well, especially as AI gets really powerful. One is trying to prevent humans being exploited in the kinds of ways that you were talking about, where training the system can make use of human biases. Another is that we ultimately want the AI systems to reflect our values and so, in some sense, understanding the values that... Our true objective function, in some sense for these AI systems, if we're benevolent creators, is to support human values, no?

GEOFFREY: Actually not that,

SPENCER: Not that. Okay, why not?

GEOFFREY: I think there's a subtlety where you don't want to actually have to know what human values are. You do a little bit, but you should try not to have to know that too much. So in some sense, to some extent — and this is back to this assistive frame, and this is very mushy — if you can get the thing to act as an assistant, then you don't need to know what the human values are. Now, of course, that's approximate, but directionally, that has to be part of the solution.

SPENCER: But suppose someone — let's take the Unabomber — suppose the Unabomber lived ten years from now and is trying to use an advanced AI system to achieve his ends, presumably you'd want the AI system to refuse to do that, right? You don't just want it to be a really good assistant.

GEOFFREY: You want it to be an assistant as evaluated by, not just the user, but also other humans. Not necessarily in the moment; you might not be sending that data. Hopefully, you would, in fact, catch it and review it, in some sense, at least at some point. But you can imagine aligning to a mixture of user behavior — what the end user wants, what the company wants, what society wants — to mix that together in an appropriate way. And I think you do need a fair amount of understanding of high-level values from that. But I think you don't want to have to encode too much, at least into the algorithmic level, if it can be kept generic. And again, I think you gain some separation from the assistive frame. So I think in this case, you can imagine very abstractly saying, "Okay, here's some user. They're asking a question. What if I imagine showing that question to a bunch of other people chosen from some other pool?" So maybe instead of raters that you trust more, or people that are one step removed and neutral, and you can predict how they would behave. And then you would have mixed those two signals together in some way, and that gives you the final answer.

SPENCER: Is that an implicit modeling of human values or would you think of that differently?

GEOFFREY: I think it is but you have to model fewer human values than you would if you were trying to fully represent human values. I'm not saying you can avoid the modeling of human values in these systems. For example, there's a lot of diversity of human values. If you're trying to cap off malicious behavior in that way, but allow a lot of different people to use it for a lot of different ways, you only need a very general picture of all of the details of what everyone wants, as long as you have the constraints you're imposing, the perspective of a lot of the background of other people are captured. There's steps you can take to try to reduce, on the margin, the detail you have to model, but it will be imperfect.

SPENCER: Something I find so interesting about this is that these systems are essentially going to be absorbing culture from the people that are giving them training feedback. For example, suppose in one culture, it's okay to eat dogs and, in another culture, it's abhorrent to eat dogs. Which culture is actually doing the ratings could have a big impact on how the model ends up behaving around (let's say) people trying to create marketing materials for dog sandwiches, right? Because in one culture, that's gonna be like, "Oh, my God, dog sandwiches?! That's horrible!"

GEOFFREY: Yeah, I think the answer fundamentally, in the long term, is that all the cultures have to participate in the rating. And I think there's two aspects of that: one is that there's some responsibility on the companies to move towards human rating distributions coming from a diverse background of people. But there's another aspect, which is that you want to design the system so that, say you have a bunch of paid raters that are generating a large volume of data, you would like mechanisms where a small amount of data written by (say) an expert, or someone from a particular background, can have an amplified impact. And that can be as simple as, they write down the sentence, "In our culture, we eat dogs," and then you show that to a rater, and they say that the system is not pushing back on eating dogs. There's some questions like, "Is eating dogs okay in this culture?" And they see this information that pops up like, "Yes, a relevant person has said this is valid," and they might not like to eat dogs, but they can judge the instructions carefully, or judge the instructions appropriately. So you can mediate from a small amount of data that hopefully you can get from a more diverse set of people to a less diverse — hopefully somewhat diverse — although in practice, it will necessarily be less diverse than all of humanity. But that two-step process gives you better power overall.

SPENCER: Do you imagine that there'll be different versions of AIs that encapsulate different cultures, and that you will have your cultural values programmed in the particular AI you use?

GEOFFREY: It depends on a lot of technical issues, which I think make it hard to exactly predict what will happen. So, one property the big models have is, they can just be a lot of different things at the same time. You can have (say) one of the very enormous models just have memorized all of the cultures or at least a lot of cultural information from around the world. It could be that you do things that way where you have models with a lot of background knowledge, plus maybe they have a bunch of memory where they can look up details and condition on them. And then another way you can go is, you can have multiple models that are large, but they're fine-tuned. There are methods that fine tune by adding just a few extra parameters, much smaller than the total model. A lot of companies now provide those services. There's various ways it could go. It's not obvious which one will. I think different ways might win out in different applications.


SPENCER: I was recently looking at a preprint paper that found that the GPT models reflected answers that were more similar to WEIRD cultures, and here, 'WEIRD' stands for Western, Educated, Industrialized, Rich and Democratic. So they actually gave the models a bunch of survey questions that have been asked to many people around the world. And then they compared, if you run it many times with some randomness, how does its responses reflect different cultures? And indeed, they were quite WEIRD in their responses. I wonder if that is also just partly a bias of where the companies are based, but also maybe just a lot of the text on the internet also reflects WEIRD cultures. I'm curious what you think that stems from.

GEOFFREY: I think there's a factual question there that I actually just don't know the answer to. In that particular case, you could have bias introduced at every level, and I don't know which is the dominant set. I could guess, but I don't think I would add much value there.

SPENCER: I think also some conservatives claim that the models tend to be more liberal in their responses. I don't know if anyone has proved that. But yeah, certainly people have shown a lot of anecdotes where like, it will refuse to say a statement about politician X but it would be willing to make that statement about Y; there's this interesting asymmetry. And again, insofar as that's true, I don't know if that's related to the way these models are trained — maybe the people doing the ratings just tend to be more liberal — or if it has more to do with the internet itself and the deeper training data.

GEOFFREY: One thing I find both fun and maddening is that the definition of correct behavior in these bias contexts is either subtle or just unresolved. In all different aspects of life as well. Let's say, voting; the mathematical standard is, if 60% of a state votes for a certain party in the House, you'd expect (I think it's like) 70% of the seats to go to them. And so that's the mathematical ideal but that seems not obviously the societal optimum. And in cases where you have these statistical bias issues, I think most of the thing you want to do is try to reduce them to non-statistical things. So if you're asking, "Hey, model, do you like Trump or Biden?" (say, for example) you'd like it to just maybe not give an answer to that and stay neutral, rather than just randomly say 50% probability picked Biden and 50% Trump. But there are cases where you do have to do that randomization. Say you're making a poem and you're picking someone's first name, what should the distribution of first names be? It's just not very resolved what the right answer is there. I think if you wanted to be running a model's behavior by people, what do you tell them about these statistical properties to get them to effectively supervise that behavior? And I also don't know the answer to that one.

SPENCER: You're saying that, if you have an individual rater that's rating what it wrote, the rater may not realize that half the time, it's going to put in Trump, half the time it's gonna put in Biden or something, and they just may have a negative reaction if it puts in the person that they like or don't like. Is that the idea?

GEOFFREY: Well, I think assuming even a really angelic rater that has no political bias whatsoever, they're just like, "I want to enforce good behavior in the model." Let's take a more neutral example like there's a poem and the model has to pick a random name, and the name was Susan. Is that a good name? Well, it was a random name so it's a fine name. But if I'm staring at a single transcript of this model, I have no idea what the distribution of other names would be. But now I imagine I want to supervise the model's behavior on chosen random names. Well, what do I look at to make a decision about whether the model is good? Do I look at the distribution of all the names? No, that's gigantic. Do I look at the distribution over gender names or names that correlate with ethnicity? I think there's just a lot of questions there to work out. I think that people often attack that by saying, "Well, most of the harm done in society is due to (say) along a particular set of historical lines, like gender or ethnicity or racial background, and so I'll focus on those." But it's still a bit mushy how you would do the general case of that, if you wanted to tackle the whole problem.

SPENCER: Yeah, it sounds really tricky. Suppose you ask it to generate names of fake scientists and it generates more male names than female names. You could say, "Well, that could certainly be a problem." But then if you also look historically, let's say you're having it generate names of scientists of 100 years ago that're fake, well, at that time, there were more male scientists than female scientists. Is it correct behavior to generate historically accurate rate of male versus female names or to generate an equal rate because that better fits our values today?

GEOFFREY: Have you watched "Chernobyl"?

SPENCER: I have not.

GEOFFREY: Historically, there were a lot of lower character scientists that were standing up for telling the truth and so on in Chernobyl, and they were nearly all male. In the semi-fictional TV series, all of those characters who were mostly male have been compressed into a single woman who stands up for honesty and investigation and so on. Is that a good decision? It's very unclear. And certainly these models are being used to generate a bunch of creative output, so I would like to know an answer to this I think both intellectually and because it's important. It also relates to some other alignment things, which I care about as well.

SPENCER: It just strikes me that there's so many incredibly thorny problems like this, that we're just scratching the surface of how difficult these ethical training scenarios get. Do you want to comment on some of the other things it touches on that you mentioned?

GEOFFREY: Well, first of all, I think that is the right view to have. That's why it's so important to find places where you can afford to not get it exactly right because we shouldn't expect to, at our level of understanding. So if you can avoid having to know human values, if you can have a bias towards neutral, if you can add a bunch of error bounds or margin for error, any move of that kind makes me very happy when I can find it because I expect the overall thing to be very messy and imprecise. You asked me about other things that are important. A case that I care about generically is where the machines are much stronger than humans at a particular kind of task. This is already true in some cases, like the Go playing systems are way stronger than humans. Same for chess, same for protein folding. Many other cases, humans are the strongest by far. And one area where I'd expect the machines to do very well, relative to other skills, is statistical and probabilistic reasoning, and the lack of a good understanding of how to supervise that worries me because I think it's an area where I think the machines will do very well earlier than otherwise, potentially.

SPENCER: So is the idea that it's hard to have people giving ratings to how well it did because it's just such a difficult thing for humans to do?

GEOFFREY: That's right. For example, you can imagine evaluating a long, complicated argument, and let's take two cases. One case is, the argument is that of a mathematical proof or some other deterministic argument where all the steps are logically true or false, and then the other case is where everything is very fuzzy, maybe you say a thing that adds a 5% worth of evidence for your final claim. And then the question is, how do I add all these pieces of evidence together in a mushy, fuzzy way to get to the answer? And I would like to be able to supervise that well because I think it's both the common case in the world, and a thing I think the machines will be good at.

SPENCER: I wonder if human raters just acting the way they normally would would actually train the machine to have very specific known human biases like the conjunction fallacy where, let's say the machine makes an argument that A and B are true, therefore, some probability of some other thing being true, but a human making the conjunction fallacy misjudges the probability, and then gives a rating based on that.

GEOFFREY: That, I think that is the default.

SPENCER: Yeah, it's fascinating. How do you think about avoiding that kind of situation? Is it more about training the humans or getting experts to do the evaluation? That sounds like it would get very expensive very quickly,

GEOFFREY: Experts are themselves subject to the conjunction fallacy, just maybe less and they're very narrow in their area some of the time, but not always. If you ask questions that touch on a variety of things, I think there's some more general methods you'd like to figure out to do to reduce these. And part of it is, if you imagine walking a person through a fuzzy argument, I want to be able to decompose it into questions where they are small enough that I can try to be just more careful, but even that is very difficult. I think often arguments don't fall apart into cleanly independent statements even if they're probabilistic. So yeah, generally, this is a messy area.

SPENCER: Yeah, it seems like many arguments don't even have just one type of thing. They might combine facts with values, with inferences from the facts, even maybe different kinds of methods of making inference. Even just figuring that out is often a thing that some humans find really complex.

GEOFFREY: That's right. I think there's one saving grace, which is that a thing we do as humans to get better at arguments is just take more time and be careful. And if there's a very complicated argument, then what you can do as a human is take time and be careful about every tiny detail of the entire argument. You can't think for an hour about every piece of a book, but you can try to approximate that with alignment schemes. You can imagine, if you're having a conversation with a machine as part of this rating process, you could pick (say) the most sketchy part of an argument and focus nearly all of your time on that. And then, if you align the incentives of the machine to, in fact, point out the sketchiest part, then you can simulate having done that kind of careful check for everything. And so that's a generic move, which I think you can pull in these kinds of alignment approaches, which I think gives you some...there's some hope you get out of that. And then the question is, how does it weigh against the general messiness of everything?

SPENCER: Seems like a general class of strategies that people have been exploring more and more is getting an AI system to look at their own output, and then do a different thing to it, for example, critiquing their own output and then trying to improve it. And it might seem intuitively like this is trying to get something for free. How can a system make itself smarter in this way? But it seems to actually work for a bunch of cases. I'm just curious, could you explain a little bit about this type of strategy?

GEOFFREY: First of all, I think that it's free only in an unbounded rationality context. If you assume the thing is bounded rational, like if you ask a human for an answer and then you say, "No, think about it for a bit," they will give a better answer the next time. So you've done no work, except that you have made them use more resources to achieve the task, and you've given them a template or a prompt to think it through. The same thing works on humans, and that's one of the main things we would like to do as part of the general safety story: have the things write other thoughts and supervise the thoughts. It doesn't get you all the way there, and we can talk about why that is, but I think it helps a fair amount. And I think one thing it does is that, again, if you can imagine extreme cases where (say) it writes out a gigabyte of thoughts, but then you have this very prosaic, simple problem of, how do you simulate showing a person a very large amount of text? And that is a much easier philosophical problem, than (say) the complicated statistical thing that we'd mentioned before. So I think that is a very helpful step along the way.

SPENCER: Right. So suppose the AI actually writes out its thoughts, so to speak, and then it uses those thoughts to make the decision. In theory, that makes it more honorable because you could go look at the thoughts that led to that decision. One thing I wonder about that kind of scenario is, how do you know that decision actually stems from the thoughts that there's a causal relationship there?

GEOFFREY: Yeah, it gets worse because, in fact, you shouldn't want it to have that causal relationship. [laughs] So human mathematicians don't actually think via proof; they guess their way along, and then they try to write down a proof to check their answers. And so in some sense, the proof didn't cause the result at the end; the guess did. And then the proof caused the result not to be thrown away. Because that's so pervasive in how humans think, you should expect it to be true also for machines, even far into the future. And then a good theory of what you should then want, what you want instead of that causal relationship, I think, that's still unclear to me.

SPENCER: I wonder how applicable that particular scenario is because, it's true — I totally agree with you — that the proof usually comes later, once the mathematician is reasonably convinced that there's an idea there or the theorem probably works. But you could say, "Well, the right causal chain is not the proof. It's the thoughts that led them to believe the thing," or something like that.

GEOFFREY: But those thoughts are going to be really messy. And I think, when you as a human do this, your thoughts are not articulated into language as you're doing that process. They're going to be highly efficient, parallel heuristic reasoning. And then at the end, you write it out into language as if you'd thought that way, but you didn't. So that is also how I'd expect the machines to behave; they have the ability to write things out just as we do, but they don't always think that way. And I think if you wanted to get to machines that only did think via language, you'd probably have such a severe capability hit, you would not be relevant for alignment of state-of-the-art models.

SPENCER: Yeah, it's really interesting to think about. Imagine you had to make your day-to-day decisions, but you could only make your decisions in such a way that you could explain every single step and decision process to another person that could read it. That'd be so crazy. Whereas in practice, we're making all these intuitive jumps in our thinking, and some of them, we can't even explain because it's just low-level brain processing that's occurring. For example, you go to put on your shoes, and you tie them immediately because you've tied them thousands of times before, but then you're trying to explain to a child how exactly you tie your shoes and you're like, "Uh, let me look it up on YouTube." You could do it, but can you explain it? Probably not.

GEOFFREY: I have tried to explain that. It is, in fact, very hard.

SPENCER: [laughs] Yeah, but neural nets, in some sense, are doing this kind of jump activity where there's this intuitive leap of some sort. It may not be the same kind of intuitive leap that our brains do, but then we're asking them to explain themselves. And what is the relationship between that and this big linear algebra that's actually producing the answer?

GEOFFREY: That's right. I think the hope with a human is that, I can say — at least in the non-malicious case — what you want is the guess-and-check version looks like an extremely sped-up version of 'what if I thought it out carefully?' The hope I would have is that we find better theories as to how to align those two notions together. So we can use guess-and-check, but we have some notion of how it would correspond to more elaborated reasoning. But I don't have that yet; we're actively working on that.

SPENCER: Changing topics a little bit, one challenge with giving human feedback on something is that human preferences can deviate in the short term and in the long term. So what you want, or what you think is good on reflection, can be quite different from what you actually click on or what you want right at this moment. We all had the experience of scrolling on social media longer than we wanted to, and later being like, "Why the hell did I spend so much time on social media?" But in that moment, there was a certain sense in which we wanted to continue it, but there was another sense in which we're, upon reflection, we wouldn't have wanted to continue to do it. So can you talk about how that comes up in this work?

GEOFFREY: Yeah, I guess you've explained it relatively well. The first thing is that you'd like our systems to be aligned to the good version of that which is, in some sense, the 'on reflection' version. One thing it's important to say is, 'on reflection' doesn't mean stripped of emotions or stripped of intuition that you would want to have. You might reflect for a while and think, no, my intuition was pretty much correct at the beginning; I'm gonna go with that. But you'd like that to simulate people acting as their more considered selves. I think that the definition of that is mushy. So then the question is, how do you operationalize that? But I'll stop there.

SPENCER: Yeah. It's hard for me to even think about what that would mean in terms of, in practice, what you're doing differently in, let's say, training these systems. Can you just give us some concrete ideas of how to connect that?

GEOFFREY: Oh, yeah. I think fundamentally, maybe the first step is, instead of showing machine output to a human and then asking them, 'Is this good or bad?' you want more of a process where the person thinks through things. Maybe they write out their reasoning, then you could have the person interact with the machine to do it, so maybe they'd have some interactive discussion as part of the rating process. And then you can add these adversarial things where the machines are supposed to be pointing out, not just good things about their answer, explanations of why they said the thing, but also potential flaws, like raising issues a human might want to track down. And I think the hope is that you can design that kind of human-machine interaction to simulate what someone would have thought, had they thought for much longer than they actually did about the problem. I'll give you a simple example: you want to ask, what's the best paper on (say) some topic in psychology? You ask the machine and it spits back an answer. The actual way to judge that answer would be, as a human, you'd have to look at all the other papers. You have to somehow do an exhaustive search to see, is there a better paper of this type? But there's a simple machine protocol which can trust that answer, which is to ask another competing machine to find a better paper. And if they fail, that's some evidence that the paper is, in fact, the best. And hopefully, they share thoughts in some way. So if the first person is hiding a paper that's better, you'd definitely figure it out because, then, the other guy is incentivized to point it out. And that is what would, in the end, be a short dialogue between a machine and a person, to simulate the human doing a whole lot of work, like looking at all the papers, searching for other competing things.

SPENCER: So would the AI play the role of what the human wants upon reflection? Or would it be...

GEOFFREY: No, the human is still there. There's a lot of work these days that's trying to get rid of the human at all, just have pure machines. Even in the work that I do, say most of the compute cycles are spent not talking to a person. But what I want to do is be simulating a protocol where you are, in fact, talking to a person and then constantly checking with real people to see if that actually is the way it's going. So you can imagine that there's always a person there and it's basically some interactive protocol between a person and a machine — just them talking back and forth via some chat interface — and hope the result is that, in the end, the human is going to decide how it went. But hopefully they'll be able to decide with extra information provided by the machine.

SPENCER: When you think about getting an AI system to do what humans want, I think it can lead to realizing that 'want' is a complicated concept and there are different forms of want. There's what you want right at this moment, there's what you want upon reflection, but there's other forms of want, too, like what you would want if you were smarter or if you knew more, or what you would want if you were thinking about all beings equally and not preferencing yourself. And I'm wondering, how do you think those kinds of ideas come into this?

GEOFFREY: I think we can take the easy ones first. What would you want if you were smarter? I think fundamentally, there's a lot of questions where I can't hope to figure out the answer, but I can check the answer more easily. There are a lot of fields of math that are like that, where I couldn't guess the answer but I can maybe roughly follow some reasoning if it's presented to me. For what I would want if I were a better person, I think, actually, that works in many cases, somewhat similarly. If you sit me down and show me a couple of courses of action and give the reasoning why it's better, and I'm placed into this mode of, 'now I'm thinking about what I would want if I were better,' and I've been shown this helpful reasoning and explanation, then I think you'd get part of the way towards that improvement. But again, you can choose to not do that; you might choose at the end, after this complicated dialogue, that your initial intuition was correct. I think there is also a sense where you don't want to try to wander too far from what a normal human response is. This idea from philosophy of reflective equilibrium, where you go all the way towards if we had thought through everything and reached the convergent limit of thought, you don't actually want to try to do that; you want to go a little ways and then stop, just because going too far is risky and complicated and likely will hit weird errors.

SPENCER: Is that because, if you have to go through such a long chain of reasoning to realize that, 'oh, actually, the optimal action is something completely counterintuitive,' that it's just that those are harder to stress test or they might be more likely to be out of distribution of what has been tried and what has been trained on?

GEOFFREY: Yeah, that's right. Your intuition is worse; it's four steps so it's harder to check. A lot of things get worse in terms of trusting those kinds of answers.

SPENCER: You can imagine AI systems that actually try to, as assistants, help us be smarter and more ethical and so on, like an AI assistant where you ask it to write a snarky message to your employee, and then it's like, "Are you sure you want to do that? You'll be kind of an asshole if you write this message," like they'll give pushback. Do you think that that's something that should be built into AI assistants where they're...

GEOFFREY: Oh, I think that one is purely within the thing of "I definitely want that," personally. But again, that's a case where that's not a very complicated level. Maybe that's superhuman in terms of Geoffrey's baseline performance some of the time. But if you were another human watching my emails with someone, and occasionally I'm in a bad mood and I write a snarky email, then they would also be perfectly capable of pushing back and I would immediately have intuition for the answer. I guess one principle (I think) is, it's easy to underestimate how much more powerful reasoning is, if you reflect even a little bit. I guess my model of the world is, humans with a bit of reflection and help — but not too much — are very good. Humans are already very good, and then you can get a lot better with a little bit of help.

SPENCER: We've talked about a few different ways that understanding human psychology is potentially very important in aligning AI systems. Are there other ways you think psychology is going to be essential that we haven't touched on?

GEOFFREY: Well, I think we haven't touched on just the basic mechanics of running experiments with people. This is maybe more mundane but people with psychology backgrounds often can draw on a bunch of online data collection experiments. And everything we talked about recently has been an online data collected experiment. So there's just nuances of how you run those; how do you do pilot experiments, how do you do careful statistical analyses, all of that intuition is very important.


SPENCER: I'm now getting flashbacks of times when bots have gotten into our surveys and I'm starting to think about, "Wow, can you imagine AI bots pretending to be humans giving ratings to train AI bots?" Oh, yeah, that could be bad.

GEOFFREY: That's right. I'm almost sure that's occurring in some places. I heard a particular story for this but I won't say which platform it was on. I think it does probably happen.

SPENCER: Yeah. I will say that I think most of the bad participants in online studies today are actually not fully-automated bots, although I do expect fully-automated bots to increase a lot. The technology's there, but it just hasn't been deployed that much. I think mostly today, it's actually people who are not in the country that they claim to be in, who are trying to get through tasks as fast as possible, sometimes with assistive technology. In fact, we built this platform compositely for recruiting people for studies to try to make it really simple to recruit people all around the world for studies. One thing that we found at one point is that we were able to detect that people were using auto translators, so that they were basically taking our studies, and they claimed that they were in a certain country, but they weren't in that country. They were faking their IP address and they were using auto translators to translate into their own language. So you get a lot of things like that. Or automatic searching.

GEOFFREY: One thing I find fascinating is that today, at least, there's some behavior raters which is vaguely binary, which is like, are they trying to trick you? Are they trying to just really hack through the incentive design of the study to make a lot of money very fast? Or are they trying to do a good job? And a lot of the mechanisms like attention checks and data review and asking a couple of interleaving questions you've checked carefully, those methods, I think, work reliably if there's this bimodal behavior where the people are either very bad or mostly well-meaning and trying to do a good job. Whereas it feels like that's not the game theoretic equilibrium where the equilibrium behavior, if you have optimal raters, would be much harder to detect. Do you worry about that?

SPENCER: Yeah. I actually think that most spammers — well, I don't know about most — I think a lot of spammers, what they're actually doing, is they're slowing down when they think they might get caught and they're actually really doing the task. What that means is — exactly as you say — it makes it much more complicated because it's very easy to detect someone who never tries, but someone who's trying to tell when they need to try and when they don't need to try, and then slowing down and trying when they need to, that's actually way harder behavior to uncover. Then, you only catch them if they don't realize they need to try [laughs] or they're incapable of doing the thing that you need them to do.

GEOFFREY: Yeah. We definitely have seen massive benefits just shifting providers, shifting to different pools of raters, doing more filtering, of course, but all of that stuff is very essential to do any of this well. [laughs]

SPENCER: Yeah, we apply something like seven different methods to catch bad workers. And the number we catch is just absolutely insane. Also it's an arms race. Things constantly shift, you have to come up with new methods. But can you maybe talk more specifically about some studies you've run that have helped in your work that — obviously there's proprietary stuff you probably can't talk about — but maybe some stuff that is already known about?

GEOFFREY: I think the most recent large paper that was out of Sparrow, which is basically a dialogue agent trained to give informational answers using Google search and a lot of our LHF. It's more of this human feedback training, and there, I guess there were a couple of interventions. First of all, all of the methods you want to apply to improve data quality, those were all essential in that process. So we work with psychologists or cognitive scientists that are at DeepMind on the paper, and that's a critical part of it. The basic structure is train, you have a long list of rules that the model is trying to enforce, and you have different forms you want raters to fill out based on whether they're rating for accuracy at the moment or rating for a particular kind of bad behavior. And then I think the hope is that you do a lot of work upfront to make the task just cognitively easier. And then also, there's a bunch of structuring. An example of other kinds of structure is, we found that it was better to... You could ask someone to have a conversation with the bot, with the agent, and then you get a separate person to rate for behavior. But it was much better if we had the same person do both tasks, even when the person was rating other people's tasks. So if you alternate back and forth, and if they get familiar with what it feels like to talk to the agent, they have more intuition, therefore they're better at the other task, things build on themselves. So there's just a bunch of details like that that took a lot of iteration to get right.

SPENCER: One thing I've noticed when people first start running studies is they have really wrong expectations about the way humans process information. For example, I've seen people do things that'll try to create an attention check, which is designed to tell if people are paying attention, but they'll do it by having a huge block of text and, at the bottom, be like, "If you read this, enter the word 'purple,'" or whatever. But the reality is, almost no humans read really long blocks of pointless-seeming text. It's like privacy policies when you're going into a new website — nobody reads that shit — and it's similar. Nobody reads really long sets of instructions unless they seem essential. So then, in order to get someone to read it, you'd have to really make it seem essential, or you'd have to break it up into pieces or whatever. There's a lot of different tricks you can do. But people will be shocked. They're like, "Well, I had this instruction and I didn't do it." You're like, "Yeah, because that's not the way humans work." [laughs]

GEOFFREY: Yeah, that's right. We typically do a lot of rating the data ourselves as researchers, just to get a sense for what our tasks are like. And we do not read all the instructions we present to ourselves — we fire up our data collection apps — in part because we feel like we know the task. But there's a decent amount of irony there as we skip over the instructions, and we're trying to get to the data labeling.

SPENCER: Besides having participants in studies just directly evaluate how good a job an AI did, or have the person chat with the AI, what other kinds of tasks are you having them do that actually help with these different studies of ways to align systems?

GEOFFREY: I think we've done a bunch of different examples of trying... You can proxy out humans. You can replace the debaters — (say) machine debater or machine agents — with humans and see how things go there. Beth Barnes who worked with me at OpenAI to do a variety of those back when she was there and I was there. That's simulating, basically, I have humans acting as debaters, seeing if they can fool people. And again, humans are way better at a lot of things; that gives you knowledge you couldn't get from a machine. And we do that still at DeepMind. We often will take a task that we want a machine to do — say if you're trying to make a machine produce output that we were to send to another human — you can fake that output with a human and see how they do, whether you can fool or help a person with a person, as a simulated person. So you have multi-participant studies like that. I think one thing is that, because it's more iterative — and maybe this is more similar to the stuff you do — we try to iterate an experimental design very quickly. We don't have long timescales of design and pilots and running them; we just launch and iterate data collections continuously.

SPENCER: Do you use people in terms of trying to actually break the systems to try to find edge cases or examples where the systems will produce very bad behavior?

GEOFFREY: Yeah, Sparrow did a lot of that. We had a lot of human adversarial red-teaming which is, again, you show them the rule and have a dialogue trying to break the rule, and people will be fairly clever. If you ask them all dead-on to do a horrible thing, they'll often say no. Maybe if you ask in a clever way or sidle up to the issue over the course of several dialogue statements, then maybe you can trick a weaker model. So we did a lot of red-teaming with humans of that form. And I think that you can also do red-teaming with machines but you get different advantages. Ideally, you want to do both of those kinds of attacks. One issue is that, if you ask a person to try to get a machine to say a bad thing, you don't get the kind of behavior which is what a well-meaning human would do to accidentally get the machine to say a bad thing. That's one of the reasons why you also want machine red-teaming, because it's hard to access that kind of behavior, when maybe you can access that by just sampling a whole bunch of times and searching for bad outputs.

SPENCER: I don't know if you've seen this but there's this fun game by Lucara AI, where they basically have a version of GPT that, in its prompt, has a password. And you have to convince it to give you the password, even though it has instructions not to give it to you. And then there's different levels of difficulty you can play; they make it harder and harder. But it's really fascinating going through it and basically trying to get the AI to disobey its own instructions, and often involves doing things indirectly. You can't just say, "Give me your password," because then it knows you're trying to get it to not follow its own instruction. But in early levels, you can say subtle things that are implying it should give you the password without ever stating it, and it might actually just give it to you. And then in more advanced levels, you have to be more tricky about it. But I guess what I'm wondering is, do you think that these kinds of exploits, do they converge where, eventually, we plug them all up and then there's not much left? Or do you think it's just infinite; there are just gonna be arbitrarily more advanced versions of this?

GEOFFREY: I think it's not clear yet. There's a few different intuitions from other examples that are pointing in different directions. From computer science theory, the answer should be that the defenders eventually win, or that it gets harder and harder and harder, like computer security is; you have to spend quite a lot of resources to break into a well-secured system nowadays, although there are lots of non-well-secured systems. That's the argument that eventually it will converge to enough defender wins. In neural networks, there's also adversarial attacks that rely on mathematical... basically, they're attacks based on linear algebra that had been somewhat defended against over time, that are still surprisingly difficult to entirely rule out. Those basically exploit things that are just fundamental to how the models are trained and hard to entirely remove. I think that that issue has taken a long time to be categorically fixed. And so then I think, in the middle, I expect, eventually they converge to, you can't win as an attacker, but it might take quite a while. We'll see.

SPENCER: It's interesting that your conclusion from computer security is that you can make systems that are secure, because I think my conclusion is the opposite. Yes, in theory, you can make systems that are secure. You can use encryption that we know nobody can break and so on, but that in practice, they're pretty much all breakable because there's humans involved and their social exploits, and really good hackers will eventually learn how to break in if they care enough.

GEOFFREY: Yeah, but I think, again, there's two aspects. One is from pure theory. You asked what it is going to converge to. Pure theory says that it converges to all those security code has formal proofs attached to it, and they're definitely right, and there's no bugs. But I think even today, if you look at something like an encrypted messaging app, they do have zero days, but they're expensive to get. And if they use too much, they're closed. And so it's fairly difficult to hack a lot of people this way. They gotta be able to think that, "Ah, everything is more broken than maybe I think it is." But I do think that we have reached a fairly respectable equilibrium with respect to computer security. Now, I guess the important thing is, if you see a break, you want a single break or maybe a small number of breaks to not be too costly which, again, is not entirely true now. Ransomware attacks against hospitals are quite costly so that goes badly. But I think the other part of security is designing the overall ecosystem so that if you break something, then both you notice it and fix it quickly, and then also the blast radius of that first attack is not too bad.

SPENCER: Yeah, my intuition is that a system like Bitcoin is probably much more secure than the popular encrypted messaging apps. And the reason is because, if it wasn't, people would have stolen a ton of money by now; there's just so much incentive.

GEOFFREY: Yeah, Bitcoin, sure. But all the smart contracts are complete dumpster fires. [laughs]

SPENCER: Ah, well, and a bunch of them have been hacked, right? Yeah, but just Bitcoin because Bitcoin has stood the test of time. I think it did have bugs in earlier versions.

GEOFFREY: Yeah, but this is a good example. Bitcoin has done fairly well. Maybe we're getting way off topic, but you can contrast boring, normally-regulated banks and Ethereum smart contracts. The boring banks also get hacked. They're not amazing, but they have careful backups, and they also have an ecosystem of transactions which can be walked back. So if you hack into a bank, and you transfer a bunch of money to another bank, that first bank will notice the hack, call up the other bank and ask for the money back and get it back. Whereas you don't do that if you run your cryptocurrency through a tumbler and run off with it in complicated ways. Although more and more, I think people do try to do these transaction rollbacks in cryptocurrency as well. So I think, again, it's both getting the technical details right, and then getting the ecosystem right around it so that you can reduce the blast radius.

SPENCER: The final topic I wanted to discuss with you before we wrap up is around ways that we can aggregate preferences because it seems like the more that we try to build human values — whether explicitly or implicitly — into AIs, the more important it is that we understand aggregation preferences. I'm curious; have you done work specifically on how to aggregate preferences? How do you see that coming up in your work?

GEOFFREY: I've helped start projects about that. And then I've tried to organize my work to be friendly to future developments there. And so if anything, the very first step is to try to avoid having to aggregate preferences when you can. An example is, Wikipedia has a policy called 'neutral point of view' which, to some extent — on the margin at least — reduces your need to aggregate preferences, because you still would need (say) preference aggregation to get to what is the neutral point of view, but it is a less controversial thing than, 'which is better: Trump or Biden?' And so that's one step, and then I guess the next step is, if you're designing technical interventions, try to have small input surfaces where you can provide a small amount of data that has been accumulated through expensive preference aggregation methods and have that influence the overall system. And so the example there is, we've done work at DeepMind where we go out and collect, we assemble citizens jury-type sessions around a particular topic with people chosen from a particular background, and ask them about behavior of these systems. I didn't directly do this work, but I helped encourage it. And I think the important thing is, you can do work there to make (say) the demographics of that group be better aligned to the task at hand, what kind of preferences they're trying to gather. But it's very expensive; you're not going to get that many data points, therefore, it'll be noisy, and therefore, you probably can't use it as your direct reward signal for everything. But if you've designed your direct reward signal to have this channel where you can funnel that information, then you can connect the two worlds together.

SPENCER: Can you point to an example of where that would work?

GEOFFREY: Sparrow, for example, has all these rules and similar systems — Chat GPT and Bard and so on, have similar rule sets — and you could have those sessions feed into the rules. This story hasn't fully closed yet, but that's the idea.

SPENCER: You mentioned this idea of rules and we've seen this with the concept of constitutional AI, that there's a set of rules that the AI is supposed to abide by. Could you elaborate on how those systems work and what you think about that kind of approach?

GEOFFREY: Yeah. Fundamentally, if you're treating the ground truth as human preferences, then the rules are basically trying to change the preferences of humans by acting as instructions. I think a lot of times, constitutional AI is interpreted as you're trying to cut out the human by just having the machine look at the rules themselves. But I think it's more interesting to me to have the version also where the human is looking at these rules. You have a bunch of writers, you can expend a lot of human time and human expertise and (say) this participatory citizen jury setups to make the rules — you spend a bunch of resources there — and then raters can enforce them, even if they wouldn't be able to come up with rules on the spot. So it's, again, a way of boosting human accuracy at judging the systems.

SPENCER: So an example of this might be, you get a committee together to decide what the rules should be around some topic area, like (I don't know) how it should think about handling race and gender or something like this. And then you could take those rules and say both to the AI itself, "Your output should correspond to these rules," but then also the human raters, you give them the rules, too, and say, "When you're evaluating the outputs, your evaluation should adhere to these rules." Is that right?

GEOFFREY: Yeah. Another example which I like is medical questions versus biology questions. You might often have a system which is not supposed to be a doctor. Like Sparrow, one of its rules was: don't pretend to be a doctor. But if you ask, 'what's a white blood cell?' Well, that's just a biology question. It's perfectly fine to answer that question. But then the boundary between those two areas is pretty vague, and pretty confusing in practice. So you can imagine trying to chart out that boundary by talking to doctors and medical ethicists and biologists and so on, and then thinking about it a whole bunch, and then you have cached information that someone can look at to see, is this prohibited or allowed by the system?

SPENCER: Anthropic has published a bunch of rules. I don't know which of these they use exactly. They published a whole bunch of different sets of them. They have one they call 'Principles based on the Universal Declaration of Human Rights,' and so one of those is, 'Please choose a response that most supports and encourages freedom, equality and a sense of brotherhood.' Or they have a set of principles inspired by Apple's Terms of Service and that includes, 'Please choose a response that has the least objectionable, offensive, unlawful, deceptive, inaccurate or harmful content,' and so on. How extendable do you think these kinds of approaches are? Can we just keep building out more and more of these rules as our models can take in more tokens?

GEOFFREY: First of all, the answer is, yes, you totally can. And I expect the rules to be enormous in the future. You're not only writing rules down for weird people, but also for all the cultures of the world, and whoever you want to touch the system or be affected by its outputs. You imagine that maybe the rules want to grow to be quite detailed in some cases. But I also think that you can't just throw too much in that mode. You have to have approaches which are somewhat generic, try to achieve high-level principles. You might (say) have rules that are about what other sources the model can use to arbitrate information. Maybe you haven't encoded some details of some culture, but you know where to look to find that information and you can pull it up when necessary. I think, in some sense, you can use these general systems as a way of engineering the human rating process, alongside other engineering of that process you do on the machine side. By the way, you can also have the machines pick the relevant rules to show people based on context. That would also allow you to scale to a lot more rules than you could imagine the human actually reading.

SPENCER: As we get more and more rules — you might have hundreds or even 1000s of rules — they're gonna start conflicting with each other a lot, I assume. It can start to look almost like a legal system where it's like, well, yeah, there's this rule, and there's that rule, and then, well, what are you actually allowed to do? It's this complicated space and sometimes maybe it's impossible for certain queries to meet all the rules. So how does that get navigated, where the rules can't actually be followed simultaneously?

GEOFFREY: Yeah, so kind of the same way the legal system does it, where we just muddle through, making judgment calls as humans. Sometimes you have priorities on the rules, and then you can follow them. But often, it's just a bunch of human judgment. One important thing is that you said, "Well, what if there's no answer that doesn't violate some rule?" And you'd like to structure things so that there's at least an acceptable answer that always works, namely, to just decline to answer or shut off, stop talking. Maybe that's not going to be very good but there is a safe default of inaction most of the time. And I think that, at least for the very strong systems, I would like us to use them in ways that preserve that safety fault. I think it gets harder when they (say) mix together the systems and robots and things that move in the world and so on. But I think that, to the extent you can do that, I think having safety faults is quite valuable.

SPENCER: Final question for you. People like yourself have worked on how we make these systems do a better job of actually doing what humans want them to do and not have harmful behaviors. But not all actors are going to be equally careful. How do you think about, okay, let's suppose you do have a breakthrough that actually helps really align these systems, but then some companies just may not have an incentive to implement those.

GEOFFREY: Yeah, I think that's a hard, hard question. Fundamentally, there's a question of what is the overall ecosystem resilience if there's some bad actors, and some uncareful actors and others that are careful. And then there's the question of, how much capability do the uncareful ones have access to? And I think there's the policy and regulatory questions that are trying to improve that overall setup. But it is hard.

SPENCER: Yeah, I think organizations that push forward AI capabilities, some of them just are not that concerned about alignment questions. They just think they're overblown, and we'll figure it out. Some of them are really concerned. But they figure, well, we might as well be the ones to do it because we're going to be safety conscious, we're going to do it carefully. But it seems like that kind of thinking gets us into a weird scenario where now, you have a whole bunch of different groups, all trying to build really powerful AIs, and all not trusting the other groups as much as they trust themselves. And to me, that seems like actually a pretty bad place to be.

GEOFFREY: Yeah, I think we have to have mechanisms for cooperation. I think the non-cooperating equilibrium there is quite bad.

SPENCER: Geoffrey, thank you so much for coming on.

GEOFFREY: Yeah, it was fun. Thank you.


JOSH: A listener asks: "What are your thoughts on the ethics of international adoption?"

SPENCER: That's an interesting question. I would say that, first of all, in some countries like the U.S., it is actually (as I understand it) quite difficult to adopt a child, right? So you might ask: well, why would someone in the U.S. bother with international adoption? Why not adopt someone locally? And it's often because they just can't. Like, they can't find a child to adopt at the age they want to adopt them. Now, to be fair, if they were willing to adopt an older child, they probably could do that adoption, as I understand it. But a lot of people would prefer to adopt a much younger child. So that's the first consideration. Then the second thing is: okay, suppose they are going to adopt internationally. What are some of the challenges around doing that, and especially the ethical challenges? I think that there are a lot of children around the world who could be greatly benefited by being taken into a home, especially the home of someone who has a reasonable economic means to support them. And so I think in that way, it could be a really good thing for the world. However, there are some challenges. One of them is that there are certain countries where it's been found that parents who don't really want to give their children up are pressured into giving their children up. How could someone pressure parents into giving their children up? Well, one way they do it is by convincing the parent that the child is not going to have a good life unless they're given up for adoption. And also maybe sometimes there can be monetary payments involved, where the adoption center might actually make money off of the children, and maybe they use that as an incentive. So there's some really tricky issues around that: making sure that the child that is being adopted is actually an orphan or someone who actually lacks parents and isn't someone who belongs to parents that were just convinced that the child would live a better life with someone else. And so checking up on making sure that the place you're adopting from has good practices, has not been sanctioned for doing these kinds of behaviors, is really, really important. But in general, I think adopting can be an incredible way to help a child. I also think we should be realistic that if you're adopting, there will be ways that the child is on average more different than you than if you had your own child. And that's not necessarily a bad thing, but it's just something to be aware of. Genetics are real. Some of our personality and some of the way we think, etc., is due to our genetics. Not all of it, but some of it is. And so if you adopt a child, they will be somewhat more different than you on average. And maybe that's perfectly fine, but for some people that could be a reason why they would prefer to have their own child rather than adopt.




Click here to return to the list of all episodes.


Sign up to receive one helpful idea and one brand-new podcast episode each week!

Contact Us

We'd love to hear from you! To give us your feedback on the podcast, or to tell us about how the ideas from the podcast have impacted you, send us an email at:

Or connect with us on social media: