with Spencer Greenberg
the podcast about ideas that matter

Episode 189: Simulacra levels, moral mazes, and low-hanging fruit (with Zvi Mowshowitz)

Enjoying the episode? Want to listen later? Subscribe on any of these apps or stores to be notified when we release new episodes:

December 21, 2023

Why do we leave so much low-hanging fruit unharvested in so many parts of life? In what contexts is it better to do a thing than to do a symbolic representation of the thing, and vice versa? How can we know when to try to fix a problem that hasn't yet been fixed? In a society, what's the ideal balance of explorers and exploiters? What are the four simulacra levels? What is a moral "maze"? In the context of AI, can solutions for the problems of generation vs. evaluation also provide solutions for the problems of alignment and safety? Could we solve AI safety issues by financially incentivizing people to find exploits (à la cryptocurrencies)?

Zvi Mowshowitz is the author of Don't Worry About the Vase, a widely spanning substack trying to help us think about, model, and improve the world. He is a rationalist thinker with experience as a professional trader, game designer and competitor, and startup founder. His blog spans diverse topics and is currently focused on extensive weekly AI updates. Read his writings at, or follow him on Twitter / X at @TheZvi.

JOSH: Hello, and welcome to Clearer Thinking with Spencer Greenberg, the podcast about ideas that matter. I'm Josh Castle, the producer of the podcast, and I'm so glad you joined us today. In this episode Spencer speaks with Zvi Mowshowitz about taking action versus telling stories, simulacra levels, and moral mazes.

SPENCER: Zvi, welcome.

ZVI: Thanks, good to be here.

SPENCER: I think of you as one of the most classic rationalist thinkers out there. And I find that your work tends to get most cited by other thinkers. You're kind of a thinker's thinker, which I find really interesting. I'm excited to explore some of the topics that you brought for us today. Let's start with this idea that people don't do things. What does that mean?

ZVI: So there's a lot of different aspects to it. In general, you come to a person and you will doubtless recognize many ways in which they could make their life better. But they mostly just keep doing the things they're generally doing. They don't change. They don't innovate. They don't explore, even when there's clearly very powerful, valuable low-hanging fruit for them to capture. If this was going to work, someone would have tried it already. If there was a story here, somebody would have found it and exposed it already. If this big company had this giant opportunity to make money by just fixing its obviously inadequate product so it didn't have these off-putting errors in it or drive people away, they obviously would have figured this out. And this is kind of similar to the efficient market hypothesis, where you presume that, of course, someone would have fixed all the prices, so you can't really do better. And I've discovered these things just aren't true. If you ask somebody, if they've done the most obvious things they should do, sometimes they'll give you good reasons why they haven't done them. Sometimes they say they have done them. But very often, "No, we haven't." If you have an idea that seems like it's the first thing you would try, there is a decent chance that nobody has tried it. And there's just remarkably low-hanging fruit pretty much everywhere all the time. And I've just gotten used to this as the way things are and learned to always, always check the obvious stuff. And ask yourself also, "Am I doing the obvious stuff, myself?" And the answer is no. It's very easy to fall into this. Just that would require doing something that would require me to exercise thought, get up, think, and figure out what makes sense, and then deal with all these little details. And we should beware trivial inconveniences, because I don't want to deal with that. And it just drives people away.

SPENCER: I think it's interesting to separate this out for personal things. If you're talking to a friend who's dealing with a difficult situation versus societal things. Because on a personal level, you can imagine when people have limited energy, they have limited brainstorming that they do, etc. And so, it seems very easy to see how someone could just miss really obvious things. Whereas if you look at a societal level, maybe it's more strange and surprising in a way if this is true, because sort of why aren't there lots of people trying lots of things, and lots of people looking for opportunities, and so on? So I'm wondering, do you think that this is sort of driven by different factors when we're looking at the individual versus societal?

ZVI: I think there's a lot of overlap and also a lot of different stuff. For the societal, a lot of it is that people's actions are very highly correlated at this point. Everybody is kind of going through the same considerations. Everybody has the same or similar pressures on them in many ways. Everyone is listening to the same opinions. Everyone is worried how the same people will think of them, in some sense. And so, people are just looking around. And also, as there are more people around to copy, there's more things to copy. There's more people who feel they've already done the preliminary stuff. Everyone just thinks, "Well, it'd be easier if I just did the thing everyone else is doing. I didn't try to innovate. And I should be modest. And I shouldn't believe that I know better than everybody else." And so, it's kind of a bystander effect. You have 1,000, 10,000 to 1 million people but you don't necessarily get more people doing something different than you would have if you had less people.

SPENCER: So if everyone was acting completely independently, you might have billions of opportunities for things to happen. But if people are substantially copying each other, or just think really similarly, or are convinced of a similar set of beliefs that are kind of limiting their behavior, then it might be, in practice, way fewer independent attempts to try anything new. Is that the idea?

ZVI: Yeah. I feel that there's not very many independent effects to actually do the thing. I also feel like we've gotten away from doing the thing and towards doing the symbolic representation of the thing. That was the first blog post when I started my blog over again and started to bring it to rationality instead of the stuff I was talking about before which is, "Are you trying to actually accomplish something with your actions? Or are you trying to tell a story to yourself or to others to represent that you should be credited with attempts to do the thing?" Because these are completely different actions with completely different motivations, completely different instantiations, and very different results.

SPENCER: So it's kind of a symbol versus the thing itself. Do you have maybe a couple of examples where people claim to be doing the thing itself but, really, they're doing the symbol thing?

ZVI: The original motivating example came from the world of venture capital and startups. So the idea was, "Are you building a company, or are you trying to be able to represent to a potential investor in your company in the next round that you are building a company? Are you selling the company, or are you representing your attempt to represent that you can sell your company in the next round?" And other things like this.

SPENCER: That reminds me of a conversation I had with a venture capitalist, where they were telling me that they would only ever invest in a startup founder if their startup idea was kind of the culmination of everything they've done before. So this person was on this path and this was just that next step on the path. And I thought about this for a moment. And I said to the VC, "I don't know if I know almost anyone whose life works that way. It seems like almost everyone, their life is kind of nonlinear. They're bumping into random things. They're not just everything immediately follows as a consequence of what you've done before." And the VC had thought about that for a moment. And then they said, "Well, I suppose you're right. I suppose it doesn't really work that way. But I still would never invest in them if they didn't tell me it was like that."

ZVI: Exactly. I can tell a story, if I want to, where all of the very, very jobs that I've had and things that I've explored have led me to this moment in my intellectual life and to my job as a writer. And that story will have some elements of truth in it. But mostly, it'll be a made up story. But if you're actually testing the founder for their ability to be able to construct and sell the right story in a room, because you're testing them for the Series A round, so what matters most is their ability to raise money in a Series B round, and then in Series C round. If you are effectively betting on the salesmanship of the founding team above all else, then you can very reasonably use this strategy highly profitably even though on the object level, it's obviously nonsense.

SPENCER: Yeah, there's this interesting circularity where the early stage investors are like, "Well, is this person going to be able to convince the next stage?" And it's like, okay, but that's just everyone trying to predict what other people are going to be convinced by, rather than looking at the thing itself. Now, obviously, there are also investors who really are motivated by looking at metrics, trying to see how the thing is actually doing, trying to see the actual quality of the product. But that's sort of a different axis of evaluating something.

ZVI: Right. At the same time, we all talk about things like the hockey stick graph, and showing user growth and all of that. Paul Graham will constantly talk about, "Build something users want. Build something that produces value." They try to pretend they're part of the same ethos and the same principle, but there are actually very distinct ways of trying to convince someone to invest in your company, and two very distinct ways of trying to build a company. And when I was trying to do my own startup MetaMed, there was very much this clash of, "Should we try to run a business like it was a real business and try to do the concrete thing, or should we build the representation that shows people what the business could be in the future? Should we do the thing or the symbolic representation of the thing?" And we learned too late that we should have been doing a largely symbolic version of the thing and it was definitely impossible for us to pivot. If you're constructing one of these hockey stick graphs, you're very consciously trying to construct this hockey stick graph. And that's a very different phase. And I think you definitely have to first do the thing where you're actually building a useful product, but then you have to transition to the hockey stick graph, in some sense, at some point, if you want to succeed.

SPENCER: So you're saying that you were focused too much on doing the thing itself, rather than the symbol of the thing? What's an example that kind of came up in your work where you were trying to do the thing rather than representing the thing?

ZVI: Well, for one thing, we were trying to charge money. We were trying to actually pay the bills, which then caused investors to have the wrong focus and look at us at a different stage of the company and a different type of situation. Whereas if we had given the product away for free, so that we could show that we were getting more users and get more reports, we almost certainly would have made that money back 10 to 100 times over instead. But more to the point, we were very much trying to dive into the medical literature — because that was what the company was about — very carefully to try and figure out how to help the patients and figure out what exactly helped the patients, as opposed to trying to give the patients a positive vibe and experience and have them feel like they had someone listened to them and care about them and it generated value. And so we paid too little attention to the symbolic composition of the thing and too much attention to what was actually going to help in the end. But if you don't pay enough attention to this symbolic thing, not only do they not appreciate what you're doing, they don't do the thing that would matter. And so, what if you told them what would help if they don't do it, if it's an intellectual exercise in your head, that didn't help anyone. So in some sense, you have to do both. If you don't take care of the symbolic portions of what you're doing, that legitimately doesn't work. But you still have to notice which of these things is different.

SPENCER: Yeah, it's so interesting in the realm of kind of health advice, you have so many different advisors that are just advising nonsense, but can get really, really, really popular. And then you have people that are advising legitimate stuff that may become much less popular or not popular at all. And so it seems that's a domain in particular, where actually tracking real substance seems less important.

ZVI: Robin Hanson talks about this question of, "Is medical care about showing that we care about each other, and that we're willing to spend time, money, effort, and emotion on each other, and a judgment of who deserves what? Or is it an attempt to make people's health better? And how much of it is one versus how much of the other?" And you have these studies that show that when you give people more health insurance and more health care on the margin, they don't live longer. They don't actually enjoy better health. Their life does not substantially improve. And so one could conclude that the majority of American health care spending is largely symbolic. There's a very important core with things like vaccines, trauma care, fixing your broken bones, antibiotics, and a bunch of other stuff that's clearly very, very important and positive. And healthcare is really important. And if you don't have access to that, it's really bad for you. But that most of our money ends up being spent on things that have very little incremental value. And that we often miss some of the most important opportunities to help people actually get better. We don't focus on their diets as much in their exercise as much as we should, compared to direct medical interventions. We spend most of our money right at the end of life, when there's a very clear symbolic story that this person has to be treated or they will die, when it's actually not so valuable to them. When people go to the hospital, we don't let them sleep. We don't let them eat well. We don't make them comfortable when we know these things interfere with their recoveries, and so on, and so on.

SPENCER: But the studies that you're talking about — I think I know what you're referencing — and I think what they did is they gave people cheaper health care in a way that caused them to use more health care. And they found that those people who had cheaper health care and therefore used more didn't get better outcomes, which suggests — I'm using the word marginal — on the margin, the healthcare wasn't helping them. But that doesn't really necessarily say that most of the health care was useless, just that at the point where you're already consuming a sort of the natural amount that you would, adding additional stuff may not help very much. I don't know to what extent that affects your point.

ZVI: The central one was from an Oregon study where they had a lottery to see who would get subsidized insurance where they would have much better access to health care and found that this was not a substantial improvement in people's health outcomes. So yeah, it's a question of, again, if you have no access to a medical system, whatsoever, you're thrown back into the 1700s or 1800s or something, you will obviously fare much, much worse. The most important healthcare spending is very important. And then there is the other question of: Are they doing a good job of consuming the important spending and not the unimportant spending when they are in the less-access condition? Is our society doing a good job of getting them what they really need? And that's hard to say. But my experience with healthcare has indeed been that a lot of what is on everybody's mind, despite the stakes being someone's health, and whether or not they will live, are the symbolic aspects of the question. And it goes back to this whole dynamic once again.

SPENCER: There's a concept that I think about that I call hype versus value. And I wonder if it's just the same thing you're talking about, or if it's just very related. So I'm curious to get your thoughts on it. But the basic idea is: I use the phrase value to refer to anything that produces intrinsic values for people. So things that meet people's fundamental needs, or make people happy, or help people achieve their meaningful goals, that kind of thing. And obviously, some products help with that and some services help with that. And I think of hype as anything that's about getting people excited, enthusiastic, or getting them social status for doing a thing. Not to say that getting social status couldn't be someone's intrinsic value, but I just think it's useful to separate those out. And so then I can think about these like two axis systems. Let's say on the x axis, you've got value. On the y axis, you've got hype. And then you can start placing things in this kind of chart, how much hype is there, how much value? And I think on something, let's say art NFTs, I would put that almost entirely in hype with very little value. Or you could have something like (I don't know) Teslas. So Tesla's cars clearly have a lot of hype but somebody would argue they could produce a lot of value too; maybe they can help with climate change. People really enjoy using these cars. They're very highly rated. And people enjoy the experience of riding them and so on. So yeah, I'm just curious to hear how that kind of maps on to what you're talking about?

ZVI: So when I hear that description, I wonder about the distinction between sort of just positional goods, where something legitimately is helping you gain status and relative positioning in some sense. And then what is closer to the conventional use of the word hype, where it's all talking bluster and you're getting people excited, and you're trying to find a greater fool, or trying to pretend that you're getting value out of something, but you're not actually fulfilling anybody's needs at all. Whereas status and positional goods, they are legitimate human needs, in some sense. We all greatly value this. And so they result in very different orientations. I don't think it's entirely symbolic to provide someone with important positional goods. Whereas when you've got this NFT, that's highly speculative, doesn't actually have any utility, doesn't go anywhere — and I say this as somebody who made an NFT game. At one point, we were trying to imbue them with utility, that the default case is completely useless. They don't even function as proper positional goods. And so yeah, I don't even know what that's trying to do, what that's trying to be, but there's some sort of hype thing going on there.

SPENCER: I think that's a fair point that those two things — positional goods — could be separated out from other forms of hype. I linked them together because I think a lot of times when something is sort of hyped in the classical sense, there's also a sense of it being cool to be part of or cool to invest in. There's a sort of social status element to it. I also think that with positional goods, where people are just raising and lowering their social status, there tends to be no net value added to society most of the time. Maybe one person gets cooler, another person gets less cool, or something like that. Some people are maybe meeting their values more but on that, maybe there's not an improvement?

ZVI: Yeah. I think the reason why positional goods are valuable is not because the amount of the goods necessarily went up, but because the act of trying to produce the goods, if it's actually in some sense a real good, will often cause us to improve our productive capabilities for competitions, for activity in useful ways. But yeah, it's very distinct from things that are fulfilling non-traditional goods, just pure good goods. We want a society that focuses as much as possible on producing the things that are not positional.

SPENCER: So going back to this question of whether there's sort of low-hanging fruit all around, that the world is just sort of highly unoptimized, I think that when I was younger, I would definitely have agreed with that, where I just see all these ways that things are suboptimal. And as I have had more experience in the world, I continue to think that that's true. But I think my perspective on that has changed somewhat, where I now think that a lot of the ways it's unoptimized is because it's actually really, really hard to fix. So it's vastly easier to notice, "Ah, this thing is not optimized in this way, but dramatically harder to actually make that change." So, often that change is so hard that it kind of starts to make sense why nobody has made it yet. And so I'm curious to hear your thoughts on that.

ZVI: I think that is indeed one of the major ways in which people learn, in some sense, not to try doing things. They find some of these very obviously great things that nobody is doing. And they find out why trying to do that naively just doesn't work. There are all these doors. And these doors look like they're opening rooms of treasure. But because they are unguarded and you try to open them, and you find out why you're wrong about some of them. And then you say, "Oh, I guess when I see something that's an opportunity. There's some stupid reason why it isn't. The world just works that way." And people just sort of slink off and they stop trying. I think it's a pretty common phenomena. But often the answer to why they don't do it is something like mild social awkwardness. Or it would be somewhat unpleasant or inconvenient to find out, rather than something will actually stop you. There's actually some stupid regulation about this, or there's some actual opposition somewhere.

SPENCER: Yet, I guess I would say it's genuinely a mix. I think probably more often than not, when you see something that's an obvious kind of poorly optimized feature of the world or something that clearly business could do better, and so on, that there are hidden reasons why it's not being done. It doesn't mean they're good reasons. Hidden reasons are very different from good reasons. It could be really dumb, but it still could be really hard to change. But sometimes it actually can just be changed. Maybe in order to change it, it requires a really good strategy. So I guess that's where I think this is true. That it's not that they're unchangeable, is that they're usually hard to change, and it requires a really good strategy, and really engaging very closely with the details of why it hasn't already been done, the details of the structure and what makes it hard to change, and then pushing in just the right angle with it, with the right amount of force to get that to change.

ZVI: Or finding out exactly where you want to push on it. Or alternatively, there's this phenomenon where as you get older and wiser, you start to internalize all of the pressures, cultural norms, and tendencies, and reasons why these things don't get fixed. So I, at one point, came into this company and was given reasonably good pay and brought authority and some equity. And I was told to go out there and make us succeed, basically, by the owners. And I went in like a cowboy. And I started just putting my hand in everything and fixing everything and improving everything. And every time a number was different than what it should be, I was like, "That's the wrong number." And I told the person involved in it why I was changing the number or why it needed to change. And I'd argue about everything. And this was very effective. But then, over time, I found out why this did not, in fact, lead to me accomplishing my personal goals for this company. And I think if I had to do it over again, I would probably end up making them a lot less money than I did.

SPENCER: So you're saying, essentially, you bumped against political issues where it actually turned out to not be in your own interest to make these changes?

ZVI: Yeah, totally. I was doing things that were either I wouldn't get any credit for it, or I would piss somebody off, or I would be seen as pushy, or as someone going outside the chain of command or exceeding my authority, or I wasn't considering all the angles, and I was too naive and dumb to realize I couldn't do the thing. So I just did it. There is a long standing saying, "The person who said that can't be done should not interrupt the person doing it, especially when they have a good reason in many cases," I think. And so we learn why we shouldn't be doing these things, why the incentives work against being the person who fixes the problem. And then nobody fixes the problem, when we'd be much better off if every time there was a problem, somebody just fixed the problem, whether or not they would get rewarded for it, and then everyone would have a lot less problems.

SPENCER: Right. And that, I think, is a great example. And sort of highlights what I think often happens with this seeming low-hanging fruit that there's something hidden like that. Again, it doesn't mean it's not worth fixing. It doesn't mean it can't be fixed. It just might mean that you have to navigate a bunch of traps that you didn't see first.

ZVI: Or you need coordination and helpfulness, and you need to be able to, in some sense, care enough about it to take one for the team, and pass up better opportunities and better trade-offs from your perspective elsewhere in order to do it, or it's not going to work. And I think when we often say, "It can't be done," we don't mean it can't be done. We mean, here are all of the different barriers in place are the reasons why it's not going to get done. Often, it's something like, "Well, technically, this person is in charge of that, and this person is hard to work with and annoying, or has this consideration, or all these other reasons why this is going to be more frustrating than it should be, or harder than it should be, or why no one is going to have the individual motivation to do it. Somebody really, really cared about this being wrong. Or someone really, really cares about not being bothered to fix anything, or just about nothing changing." And how you look at that is up to you, in a real sense.

SPENCER: One of the examples that I think about — because it's something that I work on — is the replication crisis. We have lots of different social science results not replicating. And thinking about it, there are certain actors that could take steps that would make it dramatically better. If journals are just required all papers must release their data. No exceptions. And okay, sure, if you have sensitive data, you can maybe have parts of it that you don't release. Fine. But other than that, it's got to be released. If the journalist had random spot checks where, in addition to review, one in every 10 times, they would actually try to reproduce the results and check the data carefully, and so on. There really are a bunch of things that could just be done. But then if you actually think about who has the power to do that, it's like, "How does that help them achieve their goal?" If you're one of the top journals, and you're making money by publishing flashy results, if anything, this actually maybe makes the results less flashy and less interesting. It's just not clear. Maybe at some point, the tide will turn where there's such a credibility hit that it actually is in their interest to clean things up. But until that point, maybe their interest is almost the opposite.

ZVI: I think it's very clear that their interests are the opposite and that these journals are in a position where they're extracting a lot of rent, and they are benefiting a lot from the current system, and they're not going to voluntarily change it. And I have been exploring my frustrations as well with the other side of that coin — which is the file drawer effect — where I get groups together to study problems in AI (like interpretability) where I've been warned, "Oh, if you work on interpretability and you don't have the help of the people who've been there for years, then you will just keep trying all the ideas that we already know don't work." And the obvious response is, "Well, then where are the thousand papers explaining all these techniques that don't work and why they're never going to work? Okay, maybe not papers but how about blog posts?" And of course, the answer is, "It's not particularly the urgent thing to do to publish, "Why, and here are 100 ways not to figure out what's going on inside that inscrutable matrix." But it would be really important if somebody did it.

SPENCER: Yeah, it's a really interesting example. And you just get so much less credit for saying, "I tried this thing, it didn't work. Here's what I did." You just don't get rewarded as an academic. Whereas if you find something cool, well, now you get a nice paper, you get something nice on your CV, and so on.

ZVI: Yeah. And then it comes back to this idea of, if you propose doing things a different way, then people will look at you like you're weird. Like you're doing something that you're not supposed to be doing. And this kind of awkwardness and doubt is enough to drive away, in my experience, people who when you talk to them intellectually will fully understand, "No, the current system is broken. The incentives are broken. People are not going to do the things that you want them to do. You have to take them outside that system and just do the thing." But you have the power to do the thing. And yet we have all of these people in our society, who have the affordance to help do that, especially the billionaires of us. And those people feel too socially constrained to act in the ways that will fix these types of problems. They feel the need to channel their resources through these institutions that preserve all of these problems.


SPENCER: It's really fascinating the extent to which social information puts a constraint on people, where there are many people that just feel like they're almost incapable of doing something that might be judged or even thought of as weird in some cases. One of the most striking examples of this, I believe it was South Korean co-pilots that would be realizing that their plane was going to crash but we're so reluctant to question the pilot that they will just say these really meek things like, "We're low on fuel," really calmly and then that everyone would die. And you're like, "Well, wait a minute." In that level of danger, should you be willing to just say, grab the pilot and shake them and say, "We're gonna crash!" But even at that level of danger, the level of social conditioning was so strong.

ZVI: You would think so. The last company I was working for which I founded an iteration for, but by the end of that, I was not in a position of direct authority. I was not a founder anymore effectively. I was not CEO or anything. And it was pretty obvious for months that what we were doing wasn't working, and that our company was likely going to fail. And a lot of people clearly understood this. And I think it is most of the time true in these situations. And like most of these situations, everybody just kept quiet about it. They sort of slowly indicated more and more that they understood that things were not going well, and that something had to change, but they weren't willing to raise their voice. Obviously, it's much more extreme when you're literally going to crash and likely everybody is going to die. And I would like to think that I would be able to overcome my reluctance in that situation and that most people would, but we now know that a lot of people don't. And I think you see this a lot, these very large problems that slowly get worse and worse. And people are like, "Well, in the moment, it would be more socially comfortable not to say anything, and so nobody says anything." And we wait until somehow it slowly morphs into a place where you can actually talk about it, unless you have a culture that is specifically designed not to do that. And clearly, these pilots didn't.

SPENCER: I've also seen this occur just anecdotally in my own life, where people go to a doctor about some really serious issue. The doctor tests something that confuses them or clearly misunderstands what's going on, and then they are, "Oh, I guess everything's fine." And I'm like, "Okay, but did you ask the doctor this?" And they're like, "No, but the doctor says I'm okay." And you're just like, "Wait. This is your life. You could die from this thing. You need to ask the questions. You need to push the doctor. You need to just not be like, 'Oh, well, the doctor seemed not that concerned. So I shouldn't worry about it." So yeah, there's this weird way where I think, especially around authority figures, some people can really struggle to do anything other than just do what the authority figure tells them.

ZVI: And they often know that, and for a related set of social reasons, will often just literally not tell anyone about their problem or refuse to go to the doctor. And I noticed this about myself that I don't want to talk to doctors about my medical issues and concerns because then, they will start being doctors and using their level of authority to try and get me to do things. I know that my preferences and models of the world about what I should be doing are different from theirs. But once they say that, then I have to be in a social world where I've been told that and I have a wife, and it's hard not to tell her things. And then one thing leads to another and suddenly, my life is so much worse. And even if I don't take these suggestions that I don't want to take, I've still gone through all the stress. And you know what, things aren't so bad. I'm fine. Let's not worry about it.

SPENCER: There's this quote that this makes me think of, which is something along the lines of, "I wondered to myself why doesn't someone do something?" And then I realized, I'm someone and I think it's a really powerful idea of realizing that you can just do things differently than other people. But it's scary and it's awkward, and you're gonna second guess yourself. But at least it's possible.

ZVI: We've evolved to very much fear the negative outlier. The tiger can kill you. The tiger can't make you super awesome and give you 10,000 grandkids. It can only kill you or not kill you. So if you're trying something weird like, "Maybe I get exiled, maybe everybody suddenly hates me, maybe something goes horribly wrong that I'm not thinking about, I can have a number of errors in a number of places. Huh, maybe better to stick to the rivers and roads than I'm used to and just not worry about it." So there's a huge bias against stepping out in these ways until you've seen the examples, until someone else has tried, or you've gotten comfortable with something similar. And when I end up doing things that look very different from what other people were doing, very often, it's because I creeped towards them very slowly. I tried something a little different and that seemed to go okay. And then I went to think a little more differently. And then suddenly, "Oh, I can just do this stuff. And now it looks like it's very, very different. I'm doing something very bold, but it never looked that bold to me."

SPENCER: Yeah. And I think it's worth acknowledging. There is a lot of sort of hidden wisdom in the standard path. If you do something that's off the beaten path, there's a good chance that it's not going to work. Whereas the standard path, it has something going for it. Lots of people have tried it before. There's a sort of inbuilt wisdom to that.

ZVI: 100%. Most people should be doing the standard thing, most of the time. That's just true. And depending on the definition of most, I get the standard thing, which is I understand the standard thing for me or even just anything in general, most of the time. Even the very weirdest people in the world, if you look at their actions minute-to-minute, second-to-second, step-by-step, 99% of the time, they're doing exactly what everyone else would do in that situation.

SPENCER: It also makes me think about how different people play different roles societally. There's an ecological niche in society for people to try weird new things that probably won't work out, and maybe will even be bad, but might also eventually become something that's really good. And eventually, lots of people will follow them and be able to do it too, because they've sort of opened that doorway. Whereas also, it's valuable to have lots of people doing standard things in a way that's just somewhat beneficial for society, kind of executing on the known paths that produce some value.

ZVI: Yeah, if you have everybody constantly trying new things, that's not very efficient. That's not a good idea. You need a small number of explorers and a vastly larger number of exploiters.

SPENCER: Exploiters in the technical sense of doing what's known to work not in exploiting people. I hate that it is called exploiting because it's just so confusing.

ZVI: I hate that they took our word. Exploit used to just mean extracted out, take advantage, extract the value from in a positive way or in a strategic way, or any number of things. And now it's become this other thing, and that's fine. But in the context of the word explore, I think it's very clear and I'm not worried.

SPENCER: So what are simulacra levels?

ZVI: So this is a model that originally started as continental philosophy. But a bunch of people on the rationalist sphere, including Ben Hoffman, Jessica Taylor and then I picked up the mantle, decided that we could make it our own and understand it better and reformulate. The idea is that there are levels of representation of how you use language and thought to relate to the world and how we relate to each other. And if you're operating on different levels, you have different considerations and different things that you notice and care about. And it's important not to confuse them with each other, and to understand at what level a statement or a person is operating on. And this will give you a lot of insight into that person or statement or dynamic or situation or culture. So the idea is: level one is just the basic "Words have meaning. I am describing the physical world, I am saying things because I believe them to be true, and it would be good if other people knew things that were true." So at level one, I say, "There was a lion across the river. And I mean, literally, there's a lion across the river. A thing that has sharp teeth and that might eat you across the river. And maybe we shouldn't go across the river because there's a lion there. Or maybe we should go hunt that lion, if we want to hunt the lion right now. But that's the consideration. And then level two is: what other people are treating things as level ones cannot be level two. So the idea is, "I want you to believe that there is a lion across the river." So level one is the truth. And then level two isn't exactly lying. It's: I am trying to get you to believe something about the world. I tell you there is a lion across the river, because I want you to think there's a lion across the river. And that can be valid, if there is one or if there isn't one. But it's a very different mode of thinking. I care about what's in your head, and I want to change what's in your head, because that will cause you to take different actions, which will be beneficial in some way to me. Then, at level three, actions represent something a statement about the person making them and their relation to other people who make other statements. It's a statement of loyalty. It means that I am in the faction of people who believe there's a lion across the river, or who don't want to cross the river because they talk about a lion. So when you're engaging in coalitional politics, you're operating on level three. And you're making statements that you don't really care if the statement is literally true. If it makes a physical true statement about the world, it's irrelevant to you what the level one operation is often is, except insofar as that will cause people to go. "What?" in too much of an extent. What you're doing is you're signaling loyalty to this group. You're signaling membership of this faction. And you're using this in various ways to coordinate and figure things out. And sometimes the things that you say are things that you, on some level, know are not true. And that is not what you care about right now. And then, in level four: you are treating words as they've completely lost their anchorage to physical meaning. You actually lose the ability to think logically about things or rationally at all, to a large extent. Instead, you are thinking about what associations and vibes things have, how different words will act upon different groups or cause them to be associated with different things in different ways. Originally, a lot of people try to think of four being the equivalent of two as it relates to one except that for is relating to three. But I think there's something much more toxic and subtle about it. It's a way of operating in a world where words are very vibey. And people are going to form a kind of network in their heads. And you don't think in terms of plans and concrete actions and consequences. You think much more about associations. One way to think about that is: large language models are often very much operating in a level four type of space. And that is the motivation behind a lot of the things that sound very nonsensical, but also how they pick up on a lot of things that it's kind of amazing to pick up upon.

SPENCER: So just to recap here. So these are all about sort of the mindset of the person communicating right?

ZVI: The mindset of the person communicating, and in what way they hope to impact the world by saying the words. Like, what they think is going on. So if I mind talking to you, I can be talking to you on any combination of these four levels. And the best communicators are thinking about the impact of their words on all four levels at once. If you have a Jesus, they're thinking about what the words mean on every level and checking to see what the impact will be in understanding all of that.

SPENCER: Got it. So level one is: there's literally a lion across the river. Level two is: I want you to believe there's a lion across the river. Level three is: I want you to believe I'm the sort of person that thinks there's a lion across the river. And level four is: I want to give you the general vibe that the statement there's a lion across the river gives.

ZVI: Statement that lions are across the river did well in our focus groups.

SPENCER: That's a level four [laughs]?

ZVI: You also have another way of looking at the kind of level four. But level four is much, much harder to initially grok. And simple statements tend to round it down. This thing where they think it's something that's much more sane and logical and grounded than it is, as opposed to this associational, vibey thing.

SPENCER: Why do you think these simulacra levels are important to understand? What do you get out of this?

ZVI: I think it's important to understand because I am somebody who tries to operate as often as possible on level one. I want to think about, "Okay, words have meaning. My statements are literally true. If I say things that are true and figure out what is true as often as possible, and share that with other people and they do the same to me, together we can move forward and figure this out." But you have to recognize that's not what's going on all the time in people's heads. And in fact, if there's a discussion where people are operating on level three, which is very common, especially in politics. Most political statements are level three statements, not level one statements. When somebody says, "This issue is very concerning." Or, "Isn't it horrible that this thing is happening," without picking up through the example just so I don't have to pick a bit of your side. But they are making statements to signal their loyalty to their faction, especially on this issue. But also, in general, they're trying to identify in the group. And so if they noticed the statement their side is making isn't strictly true, they're not particularly inclined to point this out. They're not particularly have the incentive to point this out. And so, maybe even these statements work in some ways better — this is something that a lot of authoritarians have figured out — when they are in fact false and obviously false. If you can convince everybody to make a statement in support of your guy, that everybody knows is not true, and say this guy has blue hair, when in fact, he has red hair or whatever, then...well, you can get them to say anything! And you've gotten them all to create common knowledge that they can together say anything and do anything. And now you're off the races.

SPENCER: This is a stronger indicator of group identity, if you're willing to say something false. You're kind of paying a cost for that membership in a way.

ZVI: Right. And what the simulacra levels do is they let you understand what level somebody is operating on. Now you can interpret their statements as their statements were created. And now you understand the situation and you can analyze it. And so, no matter what level you're trying to be on, it's helpful to know what level other people were on, and what their statements actually mean. And so, during the pandemic, you'd ask the question like, "Dr. Fauci, what level is he operating on? Is he trying to tell us your true information and hope that we will use true information well, and maybe selecting the true information that will be most helpful to us? Is he telling us the information that he thinks will cause us to make the best decisions? Is he on level two? Is he trying to indicate that he is with the science faction, that he is part of the group of people who are responsible? Is he on level three? Or is he doing something less logical and more of the idea in level four? And I think that different people react to him as if he is doing all four of these things during the pandemic. And I wish the conclusion he was on level two, and this very much colored my interpretation of his statements. You have to understand when he is telling you about the vaccine, he is telling you whatever he thinks will cause you to decide to get vaccinated, because he thinks that getting vaccinated is good.

SPENCER: Yeah, I interpreted him on that level, too. I see what you mean. And I also think that he really pissed off some people that I kind of assumed that he would or should be on level one, or people who think that scientists, when they're publicly speaking about science, should be talking about level one of just telling us the literal facts. And then it comes out, "No. They're telling you what they think will lead you to do the thing that they think is good for you." And that makes some people really, really angry.

ZVI: He's burning down a kind of joint social capital and faith in science that's been built up over the years because scientists have a reputation that they're on level one. Then when he says, "Masks don't work early in the pandemic," he means masks don't work, as opposed to meaning, "I would rather you not wear a mask right now, because we don't have enough masks." And once you know that Fauci is deliberately lying to you in many ways that he is representing things, he is saying that which is not in order to convince you to do the thing that he genuinely believes would be better for you and your health and society, now, can you trust anything else he says? Right? You can't really. You can't trust as you can interpret his statement as a declaration of what he thinks people should do. And if you understand it on that level, then you can legitimately say, "Okay, he thinks that the science says that we're better off if everybody did this." Because I genuinely believe that he actually prefers people to be healthy, not sick. He prefers people to be alive and not dead. He prefers people to be happy and not sad. I don't think he's a bad guy. But you have to understand where he's coming from. And if you instead thought he was operating on level three, then you would interpret statements in a completely other distinct level and you could almost throw them out the window in terms of choosing your own actions.

SPENCER: And you could see each of these different levels operating on something like social media. The level one people are just trying to have a discussion about what's true. And the level two people are trying to create cheap, good outcomes, sort of as activists, pushing an idea that they think is gonna help people. And then you have level three people that are just saying their tribal thing to show that they're a good member of their tribe and fighting the other tribes. And then you have love for people that are just... I don't know, saying things that... I don't know, that gets a lot of likes, or something like that. And sort of maybe just not tracking that closely of what is the content of what they're actually expressing? Does that seem right?

ZVI: Right. And then you'll navigate this mix of people much better. If you are able to identify, "Oh, this person is making a level one argument and this person is responding with a level three objection." Because now you understand what's going on, you can decide whether or not there's a useful thing that would actually accomplish something. Because if that happens, then arguing the factual point in response is not a very useful reaction. Whereas if this person was genuinely trying to figure out what was going on, then you should respond in kind.

SPENCER: Yeah, it seems like a really useful framework. And I hope that people find that it's a way to help them understand how people are communicating and also discuss how people are communicating. Because once you have this kind of idea of these different levels, you can talk about this a lot more clearly.

ZVI: Yeah, and also the idea that we have a lot of people — including a lot of our most successful and powerful people in our society — who are primarily operating on level four, and who are, in fact, in important sense, incapable of planning and capable of thinking terribly logically and forming multi-step operations, and who are operating on this kind of vibey floaty word associational path, and that you have to understand them on that level or you won't understand how decisions are getting made. And in general, that our society has, over time, been ratcheting up the level at which most communication has been happening, where there was a time when you were dealing with mostly ones and twos, and now we're dealing with a hell of a lot of threes, and much more fours than we used to be. The idea of living in a post truth world, that's a level four concept.

SPENCER: What level do you think Elon Musk typically operates on when he's posting on social media?

ZVI: I think Musk is one of those chameleons, who is capable of doing all of it, and shifts between all four levels depending on his purposes, but that he is not sufficiently skilled to be able to operate on all four at once. He is not a Buddha or Jesus or other great people throughout history. Instead, what's going on is occasionally he'll post something because he thinks it's true, or he thinks it's funny, which is a form of truth in some sense. And sometimes he'll just lie to us. Or he'll say something about Twitter/X, because he thinks this will help the platform when we all know he's full of it. But he's trying to make us believe a model in our heads. Sometimes he's just trying to build up hype. He's trying to do some sort of vibey thing because he actually knows how to do that. Sometimes he's playing factions. But what he will often do is he'll do something that's good on one level, but really bad on another level. And then ends up pissing off everybody over time, because he keeps shifting between them and nobody can keep track.

SPENCER: It's an interesting analysis. So what are moral mazes and how does that connect to some simulacra levels?

ZVI: The concept of moral mazes is: when you are in an organization, we have multiple levels of management, or you're effectively in a situation where there are many effective levels of authority. Originally, the term came from corporations but it can also apply to governments. And it can even apply to places where it's not strictly an organization, but it is a group of people that effectively have a series of authorities. For example, the investors who invest in a progressive series of rounds can be thought of as a series of management layers in some abstract sense, if you're looking at it from that way. And the idea being that you lose contact with the fundamental object level of reality when you are neither the CEO at the top making the decision or a person on the line, a person who is like working the machines, talking to customers, building the things, who has to worry about the operations. You're in a world where everything is abstract. Success is largely determined by politics and coalition's and getting along and what it looks like and what people are willing to represent about things. In these situations, what happens is the people who are devoted to success and who are devoted to the success of the people who saw themselves to be devoted to success. And to have this particular mindset, where the thing that matters is climbing the corporate ladder, or the equivalent of the corporate ladder in whatever other situations, the accumulation of the relevant form of social status and authority. That this is what matters, and you should reward the people who understand that this is what matters and prioritize it over everything else in a group and punish the people who do not express this. And this creates a situation in which any other priorities become suspect, and they get squeezed out. And if you detect them, you get punished for them. And in which, then, young people — people who are coming up and are introduced into the system — get strongly advised to adapt this persona, this level of philosophy, this approach to the situation, where they actually self modify. They change what they care about internally. Because not only will this cause them to make the right actions and care about the things that will cause them to get rewarded, but people are actually checking their decision processes and the type of person they are and evaluating them on that basis. You want to ally with and work with and advantage the people who will also make the proper decisions, in this sense, back to you when the time comes. And so essentially, any organization that is sufficiently large, with sufficiently many steps, over time becomes increasingly dominated by the group of people who are devoted to the success of the people in the corporation who are devoted to the success of the people in this group.

SPENCER: So can you give an example where you've seen this happen?

ZVI: So the original examples from the book, moral mazes, that I'm working off of were from major American corporations with often 10 to 20 plus levels of hierarchy within them, during the, I think, the 70s and 80s period, and he would do interviews. And so if you read the book, Moral Mazes, you will hear firsthand accounts from various middle managers just talking in this mindset, talking about these dynamics and these situations from their perspectives. And what I'm describing is my interpretation of what is going on there. And then the core effect, in many ways — I distilled into this concept called motive ambiguity. The idea behind motive ambiguity is: — you want to make it clear that you are not going to let other considerations enter into the decision process because you are trying to show how much you value something else. So I have a post about this where I give a number of examples. The first of which is: you are taking your girlfriend or boyfriend or other significant other to a restaurant for your anniversary. And you could choose one that you both like or one that only they like. You choose the one that they like because then they know that you are thinking of them, and that you are making a sacrifice so that they can be happy. Whereas if you went to the one that you both like, maybe you just wanted to go to the restaurant you liked.

SPENCER: Interesting. So can you tie that back to moral mazes in a corporation setting or an organizational setting?

ZVI: So in a similar setting, suppose you have this factory. And if you build it the wrong way, it will poison the river. But if you build it the right way, it won't poison the river. But the corporation probably could get away with this in some sense. And if you bring up pretty loudly, "Oh, we need to not poison this river." Then everyone has to worry that you're the type of person who cares about whether rivers get poisoned. And you might care about other things. And you might object if at some point, we want to do something that's kind of shady, just for our benefit. And maybe you're not going to be a very loyal ally. Maybe you're not going to be a very good confidant. Maybe you have other priorities. And so they look askance at you, and they downgrade you in their minds, and they're not comfortable with this. And so, even if there is no cost to not poisoning this river, you might choose to poison this river anyway.

SPENCER: Because essentially, I think what you're saying is that, by expressing caring about anything other than the sort of benefit of the people around you, you're signaling against kind of allegiance to them. Is that right?

ZVI: Right. You want to show that your loyalties are undivided. This is the thing that you care about. And this then means that you turn actively against morality, not just against outside having other interests. But if you care about what the right thing is, then you're a liability. So everyone decides to signal this by caring about doing the wrong thing. It's an important census. Because that's not you don't really care about doing the wrong thing. That's not something people actually care about.

SPENCER: So why is this important idea? What can we understand better about the world by applying this idea of a moral maze?

ZVI: So the core insight is: if you enter into or build or there exists an organization with many levels of management in it, certainly we have more than three or four, and definitely you have six, seven, eight, ten, or 15. Then, by default, people who operate in these ways will take over. They will navigate these situations such that they get promoted, they get into positions of authority, they will help each other. And once they submit themselves in control, it is nearly impossible to get rid of them. So every large organization will eventually become calcified and broken in this way, and will in fact, destroy the minds of those who enter it, and then will operate in these terribly perverse ways. And there seems to be no way to head this off indefinitely. There are founder effects. If the CEO and the people at the top are being very, very careful about this, you can hope to slow it down and head it off for some period of time. But eventually they get replaced. And you can mostly only prevent them from getting worse rather than make them better. And historically speaking, the way this plays out, is the corporations and governments and organizations that fall prey to this stop producing things. They stop being competitive. They start devoting more and more resources to these internal fights and games and signals. And they can't innovate and they can't adjust. And they become more and more wasteful. And they get out competed. Someone starts a new company that does things better. The neighboring country invades your country, or there's a revolution and things start again. But now we've entered an age when things are often becoming calcified, where big organizations and governments become too big to fail. And that becomes a very, very expensive and nasty proposition. And we're not letting it happen.

SPENCER: So is the idea that things failing is obviously not good for the people who are failing, that allowing things to fail, at least as a kind of resetting action, where when a sort of organization gets too detached from doing the thing itself, it becomes too much about the sort of the self-perpetuating, middle management helping middle management, that at least failure kind of reboots it.

ZVI: Exactly. That failure reboots the system when it's no longer working for this or for any other reason. And also, the threat of failure keeps you sharp. If you are not worried about being replaced, if the system is not sufficiently efficient, then the people who don't particularly care about efficiency but do care about their own local success will continuously defect against the efficiency of the system, and they'll keep winning. And that's, in fact, what allows you to keep them in line: the worry by everybody that if you let this thing get out of hand, then it's bad for all of us. And if that stops working, the whole thing stops working. And the only solutions that I have found to this so far in my exploration of a book-length talking about the topic, where to either periodically replace the failing organizations with new ones, or keep yourself sufficiently small that everybody is grounded on object-level, in some sense. If you only have two layers of people, or at most three, now you're pretty much okay because at least everybody you talk to is relating to things that keep them grounded and concrete, and are not walking in these aimless layers of abstraction.

SPENCER: So in what way has society shifted so that these institutions that have become detached from reality aren't able to die? Are we talking here at the company level? Are we talking here at the country level?

ZVI: On the country level, countries used to exist, in large part, in order to protect themselves physically: to fight wars, to ensure that they didn't lose competitions for power with those around them, and to be able to protect themselves against internal risings. But it's become much harder to disrupt these things. It's become seen as more expensive and unthinkable to disrupt these things. And it's happening less and less often. And for corporations and businesses and so on, we're also seeing much less turnover than we used to. And a lot of that is because these businesses have been able to engage in regulatory capture. They've been able to engage in setting up various positions of rent, in relation to the government. And there is in fact, a maze-like culture that has taken over. Some form of this is the military-industrial complex is related to these concepts in many ways. So the idea is: when all the people in all these big organizations are in fact looking to cooperate with the people who think the same way in these other organizations, to try and protect the existing calcified organizations and their people against the outsiders — and this is the primary way in which a lot of systems in our society now operate — it becomes very, very difficult to disrupt.

[promo break]

SPENCER: So for the final topic before we wrap up, let's talk about the relative difficulty of generation versus evaluation. So can you explain that idea to us, and why you think that's really important?

ZVI: This idea comes up in the context of artificial intelligence and the idea of being able to align, train, and fine tune a system to do what you want to do to imbue it with the skills you want to have and the morality you want to have and the decisions you want to make. And so, right now, what most systems are doing is something called RLHF or Reinforcement Learning from Human Feedback. The idea is that you have the AI produce outputs, you put them in front of people, and you say, "Which of these do you prefer?" And the humans teach it, "Oh, don't be racist. Oh, say something that actually answers the question you are being asked. Don't drag on too long saying random things. Don't make mistakes. Don't hallucinate, etc, etc." And it only learns what human preferences are to answer something that ideally is more and more helpful. And it is widely recognized that this doesn't really work over long periods of time as we get more powerful systems. Because at scale, you need a human to physically be able to answer each of these questions. So it's costly to get each piece of feedback. And also, these pieces of feedback are limited to human scale. So if you start training systems that are much smarter than us, that are much more capable than us, then we won't be able to properly evaluate their outputs. So instead, we need to turn to doing automated feedback. And so the idea is: the first version of this is called constitutional AI — which is something that Anthropic is working on — the idea is that the AI will use a written set of guidelines to evaluate its own outputs, or a different AI will evaluate the AI's outputs. And we'll use that to provide feedback to the AI to train. And in general, if this concept of 'I produce outputs, you produce feedback on those outputs. You tell me whether this output was good or bad, whether I should train to produce more like this, or more unlike this. And then the model weights are adjusted such that it understands the information you're giving it. And then you train it to optimize for the best possible feedback. And one of the top ideas in alignment is: evaluation is easier than generation. Meaning, it's easier for me to produce the output to create the response than it is for me to figure out if the response is good. So in many contexts, this is in fact true. So I can decide if you produce good food, or you built me a good table, or if this airplane was able to fly from one area to another area, or if this car falls over. But I'm unable to actually accomplish these things on that level at all. It's much easier often to evaluate if somebody is good at their job than to be able to do that job themselves from a certain point of view, or a task or an answer. But this isn't always true. And so, you have a mode of thinking that started from Paul Cristiano, I believe, and now has been championed by among other people, Jan Leike, who is the head of alignment and the Superalignment effort at OpenAI. And they are counting on fundamentally this idea that if the AI produces an output, then it is easier to tell if that output is good than it was to generate the output. Because if that's true, then you can use a less powerful, less capable, less smart system and have it figure out if the output was good or bad or which of these two outputs is better. And then you can train the more powerful system iteratively on a slightly less powerful system. And then if you do this enough times, you can get an arbitrarily powerful system to reflect what you want, and that is potentially its entire solution to what we call the alignment problem. And then we can train arbitrarily smart things and not worry that they're going to do something that suddenly is not what we want or they're going to turn on us or do something else horrible.

SPENCER: So to summarize that a little bit. So the idea is: if you can't actually have humans evaluating the output, because the output is, let's say, too difficult to evaluate or requires too much skill, knowledge, or intelligence, or it's just too costly to get enough training data, what you can do is you can have a somewhat less powerful AI system evaluate the output, looking at a set of rules about how it should evaluate it as part of its prompt, and then use that instead. And then as you use it to make your system better, then now, you can use that better system to then evaluate and give feedback. And then you can use that to make it even better, and so on. And maybe this creates kind of a bootstrap effect. Is that the idea?

ZVI: Right, if GPT4 can evaluate the outputs of GPT5, which can output the evaluations of GPT6, and so on, then, we can evaluate properly through iteration the outputs on GPTn for arbitrarily large end, which can be arbitrarily smart, and arbitrarily capable. And then use that to cure all disease and spread throughout the stars and create unlimited fusion power and all the other cool things.

SPENCER: Cool. So how does this relate to sort of the difficulty of aligning these systems?

ZVI: So when you are giving the AI feedback, the AI is not going to differentiate between the mistakes that you make between the preferences that you have inherently and that you want it to exhibit, versus the things that you respond well to, even though if you actually understood what was going on, you wouldn't actually like that so much. It only knows what gets good feedback and what gets bad feedback. So for example, if it fools you, it tells you something and you think, "Oh, that's amazing. That's super cool. I didn't know that." But it turns out, it's just wrong. He just hallucinates the whole thing. It will learn, "Oh, this is the thing that fools humans into giving me good marks." But it also won't ever learn that it did that by saying something untrue, it'll just know that it got good marks. So there isn't necessarily a concept inherently fooling the human here, depending on how it models internally, how it is representing things...

SPENCER: But you could also see it learning a rule like, "Oh, if I'm going to say something that's false, I need to make sure it's false in a way the human doesn't realize in order to get good credit."

ZVI: Absolutely. And so in general, if you are training the AI to predict the next word, which is the first step of training, then you're not getting it to do exactly what you want. But it's not going to create an adversarial training effect. As such, it's just going to produce whatever it thinks is the most likely next word. If you're trying to train it to give you something you will rate highly, you have to worry that you are training it to do whatever hacks your evaluation system the best. That you're hoping that the way to get a good evaluation is to actually be helpful, to actually provide true and useful information, to actually follow all of our preferences. But if there are ways in which that is not true, it will find them. And it will absolutely take advantage of it.

SPENCER: It seems like this could even be more compounded if you're using earlier models to train the later models. Because if you had sort of a tendency in one of the earlier models to sort of say pleasing things that are untrue, then you use that model to create the label training data for the next model. Isn't it just going to propagate it forward into the next model?

ZVI: Yeah, absolutely. The errors that you make compound every time you move up. So if you have this gap, then you're only going to make things worse. You're not going to get de novo additional alignment, and you have to correct your mistakes. Unless you have a new clever way to do that, that isn't in the plan descriptions that I've seen. And thus, things just get farther and farther away from what you actually wanted. And so in general, my view is essentially that in order to be a useful evaluator, at this level, when you've got very, very smart things, you have to be very, very precise and accurate and just not make systematic mistakes. If you did make any systematic mistakes, it will figure out what they are, and it will exploit them. And it will learn things you do not want it to learn. And you are effectively facing an adversary.

SPENCER: You can even imagine the more advanced models learning to exploit weaknesses in the earlier models. Let's suppose that the earlier models have a weird thing, where if you say a certain kind of thing, they're really happy about it. That makes no sense. The later model is just going to learn to say those weird things, those weird edge cases that the earlier models think are really good for nonsensical reasons because they're slightly just not what we intended to train them on.

ZVI: Absolutely. And so it boils down to the question of: Can I evaluate well enough the outputs you are generating, such that you can't do anything adversarial with those outputs, and you are not going to figure out unintended ways weird things that you should be doing that I wouldn't intend as part of the system? Because otherwise, things will entirely break down over time, as you attempt to ratchet up step-by-step-by-step. And this is relying on the idea that at whatever skill level I am required to generate the outputs, that I can use a lower skill level than that to form a sufficiently accurate evaluation of what I'm looking at. And that requires that a second task be easier, not harder than the first task. I do not think this is in general true. I do not think it is specifically true here.

SPENCER: Yeah, it seems like what you're describing sets an almost impossibly high bar. That in our process of labeling the data, whether having humans do it or earlier AI models, that there aren't these sort of exploitable edge cases. But how would we ever know that there aren't exploitable edge cases?

ZVI: We never know there are no exploitable edge cases. So one response you can make is, well, we're not going to train enough different data points for these narrow, weird edge cases to get recognized and fully exploited. Or maybe there'll be some distortions, but they will be limited in the scope, and it will mostly be fine. But I don't put a lot of faith in these approaches in these situations. I think that as we get more capable systems, they will be much, much better at very quickly narrowing in on these types of exploitable mistakes. Or rather, they'll just have more automation fresher. They will figure out faster what will actually give you a good response. And there's no way to differentiate between the response you meant to make and the response that you actually do make. There's only whatever the evaluation function is that you're actually using, for better and for worse, and it will learn what that is. And I see this as a very, very serious problem. When I look at other practical tasks in the world that I go through, sometimes evaluation is easier than generation. I think it's much easier to evaluate whether or not a table is a good table than it is to build a table. But at other times, it's much easier to write a program that halts than to know if a program halts, for example, in a very strict one. And it is much, much easier to say the kinds of things that my nine-year-old says, than it is to know the proper response to that child that will teach them the lessons they need to take away from this interaction such that they will be able to properly improve and understand what I care about, and what's true about the world, and what's important.

SPENCER: So if we think about this idea that for some problems, it's easier to generate the thing but harder to evaluate it. And some things were reversed. What do we take from that idea? Whether a situation is harder to evaluate or harder to generate, what are the conclusions from that?

ZVI: If you're trying to get good at the thing, or make the thing happen, then you have to ask, "What is my hard problem? What do I need to figure out?" So for example, last night, I was talking to someone whose name is [unclear], who's a friend of mine. And she is trying to figure out if there are latent scientific discoveries that AI is to discover and then bring to the world. Because the idea being: often 30 years later, some will be like, "Oh, there are these two papers that were written 30 years ago." And you combine these two ideas, suddenly, you have this highly useful thing that nobody's ever realized because they're just coming from different people in different places. But an AI could be very, very good at just checking all the pairs to see if something comes up or reading all the abstracts and seeing if something comes out of that. And what's gonna happen is the AI is going to come up with a lot of different ideas here. And then the question is: What's the easy step? Is it generating hypotheses for things that are interesting? Or is it evaluating which of these hypotheses are true? And depending on the detailed context, but for example, in drug discovery, is it easier to come up with compounds that might be helpful? Candidate drugs? Or is it easier to evaluate which of those candidate drugs work? Right now, the limiting factor very much seems to be actually trying the candidates and finding out which ones. That's really hard and expensive and slow.

SPENCER: So does this suggest that the main problem in AI safety is figuring out how to do better evaluations?

ZVI: I would say whether or not you can do evaluations that are sufficiently high quality without having a similarly or perhaps even greater capable system. In context, it is going to determine whether or not you can use a wide range of strategies. And we don't really have other strategies right now that seem that promising. So it seems like a really important problem to be able to do. If we can afford to evaluate AI outputs using less capable or, at worst, similarly capable systems, and this is sufficiently good feedback that it converges on something that is indeed aligned to what we want, we can stop working. But it makes the problem that we now live in a world where alignment is what some people call easy. It's a tractable, solvable problem. We just have to do the work. And it's clear that our current methods for doing this will not scale properly. You see this very, very explicitly in the 80,000 hours podcast, for example. But it is very possible then that something like iterated distilled amplification done in a careful, well thought out manner, would in fact just work. And we just have to build it and iterate and run the experiments and figure out how it goes. However, if that doesn't work, then we don't know what to do. We're just unmoored.

SPENCER: One thing I've been thinking about lately is systems like Bitcoin and their relationship to building really safe secure AI systems. Because there's something interesting about Bitcoin. There's this protocol that anyone can read the code of it. And if there was a bug in it, you could potentially make a huge amount of money. And we've seen this actually with other crypto projects, where someone finds a bug in a protocol and takes $50 million or $500 million. And so, why do we believe that Bitcoin is secure? It is because there's so much money lying around and there's been years of people poking at it and there's so much money lying around for someone who could potentially exploit it, that we kind of assume it's unexploitable? And I wonder if something like that could be leveraged for AI safety. Imagine you had a system where you could, just anywhere in the world, kind of poke at it. And there was money on the line, where if you found a way that you could exploit it, you could actually make money from doing that. And as time went by, and the exploits kind of get patched over and over again, you eventually get to the point where years go by and nobody seems to be able to find any more exploits, even though they could make increasingly large sums of money, you kind of get increasingly high competence. So I'm curious, do you think there's something like that that could be applicable for AI safety?

ZVI: So there is a current system that's very much so, where GPT4 comes out, and then everyone in the world goes, "Hey, let's see if we can make it say the nasty thing that it's not supposed to say." And we Red-Team it. We all try different things. We try to jailbreak it. We try to manipulate it. We try to twist it around in various ways. And then every time we do, OpenAI is alerted to this. And they say, "Oh, I'd better plug that hole." And they figure out what's wrong, they fix it, and they iterate. And now it's pretty hard to get the GPT4 to say the things we don't want to say. It's still possible. And then over time, if we just kept the system static, it would get better and better at this. The problem being if we did this with an actually dangerous system, we would be dead before we had finished plugging all of the holes. It's not that simple. We can't just Red-Team a highly dangerous system in this way.

SPENCER: So, sure you can plug up the exploits of one particular version of the AI. But then the concern is that as you go to the next version, you could just end up in a really dangerous situation before you've plugged the exploits.

ZVI: What we're hoping in some senses is that these exploits will be generalized and will be totalizing. That we can fully describe all the ways in which GPT4 goes wrong. And in doing so, we fully describe all the ways in which GPT5 and GPT6 will go wrong, and so on, until the actually dangerous systems. And I just don't think that's how the world of physics is going to be kind to us in that way. I think that we're going to find new security problems as new systems gain new capabilities, new intelligence, and new size. And what we Red-Teamed on the previous system will not be sufficient. Those solutions that we had will stop working because the systems will effectively find ways to plan around them. Then they will be working in different ways and presenting us with different threats. And so, Rob Miles described the problem as having a safe LLM is like having a safe operating system. That's what it means to build something safe. And then Paul Graham noticed, but the only way we know of to have a safe operating system is to start with an unsafe operating system and release it.

SPENCER: Yeah and then wait 10 years [laughs] and then if it gets updated too much, it's no longer safe. It's gonna have bugs again, right?

ZVI: Exactly. Every time you're updating it, or there's just new things it's interacting with, potentially you're introducing new bugs into the code. You're introducing new problems that you have to deal with. But over time, you can be confident that your operating system is more and more safe. It's harder and harder to exploit and has less and less dangerous problems with it. And then eventually it becomes safe and then maybe you can iterate on it relatively safely. But you'll only do this by creating a series of opportunities for exploitation to get humans to exploit it. And we're also counting on the things that are trying to exploit this operating system or this code only being human. If in the future, the thing that's trying to exploit the thing is, in fact, far more capable than the humans, then the humans hunting for opportunities are not going to find things that are smarter or more capable. And then they can iterate over more things and search more of the space and have more things in its memory and cross compare more combinations will be able to find. So you're not going to be able to anticipate all of the problems you are going to face, the way we might be able to now. And indeed, bugs in code are one of the examples I think of where evaluation is much harder than generation. I can write code that does the thing most of the time that I want to do far easier than I can actually write bug free code. Or that I can evaluate the code of somebody else that does the thing I want, mostly, or most of the time and find all of the bugs in that code, especially if I don't know how many bugs there are, whether there's any bugs at all.

SPENCER: This conversation makes me wonder whether as we build these models to be bigger and bigger, there's a way of doing it much more slowly. So if we look at the way OpenAI did it, they went from version 2 to GPT3, then 3.5 and 4. Okay, maybe they had internal models that were in between those. But imagine that you could release GPT 4.0001 and then GPT 4.0002 and each model is just sort of slowly learning on top of what it was learned before, rather than sort of large discrete jumps. I don't know that that would help anything. But I wonder if an approach like that couldn't make it so that it's much less likely that it sort of suddenly being able to be safe at one level will then kind of more smoothly transition to being safe at higher levels, rather than sort of usually release the next model, and now it suddenly has all these new exploits that you'd never seen before.

ZVI: I think the smaller the jumps that we make, the safer we are. And there's a lot of reasons for that. One of the reasons is just: if GPT7 is the version that's capable of actually killing us or that leads to this horrible problem, then GPT 6.97 — that isn't quite capable of fully pulling this off — is probably going to create some sort of lesser disaster or attempt to create the problem or show a lot of signs of creating this problem. And then maybe we'd be able to realize, "Oh, now we have a problem. We should stop here and not go to GPT7 until we've done a bunch more work." And also, yeah, we have the ability to use 6.97 to then do all of our research, try our things, see what's about to happen, see what might be happening, accelerate all of our other work, including our alignment and safety work. So yeah, the slower we go, the better off we are, including the more incremental steps we make, the better off we are. I'd rather go for 4.1, 4.2, 4.3 every month, than just jump to five in 10 months, even if it meant that at any given moment, we had a more powerful AI than we would have otherwise had.

SPENCER: Right. Because even if 6.97 is not enough to kill all humanity and 7 is, 6.97 might be able to kind of give you much more of the idea of the exploits available than version 6 was.

ZVI: Yeah, it's probably going to scare the hell out of us in some way. It's probably gonna get misused in some way. One of the things I think about is: if you have a base model, that base model might just be inherently dangerous if it's sufficiently capable, such that you'd give it the wrong instruction, or you unleash it in the wrong situation or on the wrong problem and all hell breaks loose in some sense. But you also might have a problem where, if you were to use it responsibly — and this is the question of the types of systems — it would be fine. But if somebody was intentionally trying to use it irresponsibly, or somebody who hooked it up to the wrong type of scaffolding, made it agent-ties more explicitly, did various forms of work to try and make it more dangerous, or put it in a situation where it was given an open ended goal, and various clubs were taking off, or various additional capabilities and affordances were given to it, etc, etc. Now, it becomes far more dangerous. Then there's this danger that we release something into the wild, or we just make it available for use, and we train it and we fine tune it, and it contains the seat of power that can then reasonably cheaply be amplified with the right techniques. But we don't necessarily have those techniques yet. But then two years later, those techniques get developed. And now we're all in trouble. And the more we make incremental steps, the more we do have to worry about that sort of thing, but also the more warning we might have of, "If we develop 4.7 and 4.7 is safe when used responsibly, but we can figure out a way to use it irresponsibly if we really wanted to, or if some malicious actor was intentionally trying to make the thing go haywire, he could do a hell of a lot of damage." Well, now we know that, maybe even release 4.7 with the proper safeguards. If we keep them out away in security vaults, and we check the inputs, and we don't allow anyone to call it too many times in a row and various other things, maybe the misuse problem is not so bad. But now we know we absolutely cannot go to 4.8 until we've done a lot more homework. I do want to emphasize these problems of alignment and exactly how to go about deploying systems, these difficulties are incredibly hard. Most people, even those who work on these problems, I think, do not recognize the degree to which they are hard, and the degree to which we need to figure out a lot of solutions to a lot of impossible problems in order to get through this. But that also creates a situation in which it is very hard often to know which approach to certain questions like this will work better because they all make some set of impossible problems easier and some other set of impossible problems harder.

SPENCER: Where's the best place for people to find your work?

ZVI: They can find me on Substack. That's They can also find my work on Wordpress with the same name and on LessWrong. And I'm on Twitter as @TheZvi.

SPENCER: Zvi, thanks so much for coming on.

ZVI: Yeah, it was a very fun conversation. Thank you for having me.


JOSH: A listener asks: What do you like most and least about yourself?

SPENCER: Good question. Some things I like most about myself: I like that I essentially always have positive intent. In other words, I never want anyone to come to harm even people I think are harmful people, I don't want them to come to harm. I want everyone to thrive, and I want everyone to be happy. Of course, sometimes you have to protect people from other people. Some people are going to cause harm, and they may need to be locked up in jail and things like that. But I don't want anyone to come to harm. And I like that about myself. Another thing I like about myself is that I really love exploring ideas. And I really love learning new things and figuring out better ways to think about something. So if I have one way to think about it, and someone points out a better way to think about it, I like about myself that that's exciting to me, and I tend to want to adopt this better way of thinking about it. Things I don't like about myself: One is that I worry more than I would want myself to worry, and I don't think the amount that I worry is useful. That's something I've worked on for a long time. It's definitely improved over the years, but I still worry more than I would want to. Another thing I don't like about myself is I feel tired more often than I would want to feel tired. I've always felt tired since I was a child. And I find it frustrating and annoying, and I've spent a lot of time trying to improve that. But I still think I tend to feel tired quite often.


JOSH: A listener asks: "What do you like most and least about yourself?"

SPENCER: Hm. Good question. So, some things that I like most about myself: I like that I essentially always have positive intent. In other words, I never want anyone to come to harm — even people I think are harmful people, I don't want them to come to harm. I want everyone to thrive, and I want everyone to be happy. Of course, sometimes you have to protect people from other people; some people are going to cause harm, and they might need to be locked up in jail, and things like that. But I don't want anyone to come to harm, and I like that about myself. Another thing I like about myself is that I really love exploring ideas, and I really love learning new things and figuring out better ways to think about something. So if I have one way of thinking about it, and someone points out a better way to think about it, I like about myself that that's exciting to me, and I tend to want to adopt this better way of thinking about it. Things I don't like about myself: One is that I worry more than I would want myself to worry, and I don't think it's useful, the amount I worry. And that's something I've worked on for a long time. It's definitely improved over the years, but I still worry more than I would want to. Another thing I don't like about myself is: I feel tired more often than I would want to feel tired. I've always felt tired since I was a child, and I find it frustating and annoying, and I've spent a lot of time trying to improve that. But I think still I tend to feel tired quite often.




Click here to return to the list of all episodes.


Sign up to receive one helpful idea and one brand-new podcast episode each week!

Contact Us

We'd love to hear from you! To give us your feedback on the podcast, or to tell us about how the ideas from the podcast have impacted you, send us an email at:

Or connect with us on social media: