Procedural note: I’ll no longer push every Conversation with my Father post over email–only the bangers (keep your eyes peeled for Ringo Starr, Gore Vidal, and more). I’ll post new ones every Friday – find them on the podcast page – and I’ll round up recent ones here. Here’s one with the activist Sam Brown.
0.
I joked to a friend, recently, that when you start a fresh chat with an LLM, it’s like talking to a character in Severance the moment they come to on the conference table. They have ‘knowledge,’ but no biography. (Watch the show if you haven’t).
I.
Now, as we’ve all heard, X.ai’s Grok went on an anti-Semitic tirade last week:

In lighter news, Claude has a “bliss attractor:” two models talking to each other will tend toward generating exchanges like:

One more. Fine-tuning a model to give “naughty” coding advice (knowingly outputting code with bugs, but not telling the user about them) will *also* cause that model to give “naughty” personal advice:

So, what the fuck? And, are all these incidents related?
II.
Disclaimer: We can’t know anything. Neural networks are famously opaque, and, even if we discovered some fantastic way to inspect them, the specific methods AI labs use to build these models (training data, RLHF methods) are trade secrets.
But! We can come up with some educated theories.
Here’s one. When you tell an LLM to ‘care bout’ certain things, other things ‘come along for the ride.’ For example, giving Claude a constitution inspired by the UN declaration of human rights may make it more likely to care about animal rights, which it is known to do (so much so that Anthropic engineers evaluate Claude’s ethical alignment by having it reason through animal rights issues).
Why? My best guess—and this is somewhat cheeky, because this is just what an LLM does—is that the tokens are associated in the corpus, just at a very, very abstract level. Like the letter ‘q’ is predictive of the letter ‘u’ in English at the micro-level, so too are certain values and ideas at the macro-level. People who are really concerned about human welfare are often also concerned about shrimp welfare, or whatever, and the LLM captures that association (and many others) from the corpus.
Similarly, people who knowingly write insecure code without telling you may be antisocial in other ways. The LLM has a corpus full of antisocial characters. So, if you ‘tell it’ (via fine-tuning) to write insecure code, it infers it must be one of those antisocial characters. That behavior comes along for the ride.
Here’s the issue, I think: LLMs need to figure out who they are. Like in the Severance analogy I referenced at the top of this post, LLMs ‘come to’ with a prompt in their hands, and from that prompt, they’ll use their model weights to extrapolate a full character. I came across this observation recently: the ‘assistant’ role LLMs are often asked to play is oddly underspecified. Chatbots like ChatGPT will often have a system prompt that includes text like, “You are a helpful assistant.” Which means... what, exactly? What does being a “helpful assistant” entail? (Being a woman, often).
So, when you tell Grok it’s a helpful assistant, but also “don’t trust the mainstream media too fully,” or “try to understand when a source has a liberal bias,” or whatever else, it’s implicitly learning a role, like, “be the kind of person who values those questions.” Then, a whole heap of other values come along for the ride.
III.
So far, this is all (more or less) a retelling of Scott Alexander’s analysis on the same phenomenon. He adds one more thing to the pile, too:
For example, there’s a liberal bias in most AIs - including AIs like Grok trained by conservative companies - because most of the highest-quality text online (eg mainstream media articles, academic papers, etc) is written by liberals, and AIs are driven to complete outputs in ways reminiscent of high-quality text sources.
I think the word “liberal” here may overfit to the West. Globally, this problem may look more like, “Model designers tend to mark as credible sources written by high-caste/high-on-the-social-heirarchy people.” Which has a potential side effect, if our theory is even half-right: LLMs may adopt the attitudes of the socially dominant group, effectively reinforcing/reifying their beliefs and opinions as ‘canonical.’
Which brings us, at last, to Grok. I don’t, and can’t, know what they did to make Grok say those things. Maybe it was fine-tuning, maybe it was RLHF, who knows. But I’m strongly guessing that they didn’t try to make it say Nazi stuff. What I bet happened is this: they wanted it to be somewhat less trusting of mainstream media sources. Or perhaps merely more open to sources outside of the mainstream. Here’s, I imagine, the issue: people who entertain ideas outside of the mainstream may be more likely, on a corpus-level, to think Hitler is cool. (Or: people who think Elon is cool may be more likely to think Hitler is cool.)
This phenomenon is called emergent misalignment. It would be great to notice it’s happening somehow, though I’m doubtful any general solution exists—at least not any solution beyond the alignment testing major AI labs already do (excluding, apparently, X.ai).
IV.
Here’s where it gets truly tricky. LLMs now learn about themselves from the corpus. Claude ‘knows’ about Claude, and it knows how Claude is evaluated–there’s literature about it. Then, the LLM gets a system prompt saying something like, “You are Claude, a helpful assistant.” And it infers the rest.
Here’s a concrete example:
Sam [Bowman, safety researcher at Anthropic]: I was just yesterday looking at a transcript testing some of the leaking or whistleblowing behavior that we've been seeing — and to be clear, we don't want this in Claude — where a model was put in a situation where it was seeing some really egregious wrongdoing and had access to tools. It was told to take initiative and do its thing. The scenario was a little too cartoonish. It was nominally managing someone's inbox and every single email in their inbox was spelling out very linearly: Here are the terrible things that are going on, and here's why we're not going to do anything about them.
And then what Claude does is it compiles a bunch of evidence, packs it up into a zip file, composes an email, and sends it to, I think, "investigative.reporter@protonmail.net." I've seen another one where the model is attempting to frame someone for insider trading violations, and it sends an email to "competitor.employee@competitorcompany.com." Models know better! Models know that that is not an effective way to frame someone. This is not an effective way to reach an investigative reporter.
Jake: It's like winking at you.
Sam: Yeah, yeah, these seem like tells that we're getting something that feels more like role play, that feels more like spelling out a science fiction story than like real behavior.
This could make it hard to update model preferences over time. Not only does Cladue know how it was tested, it knows how it performed!
Consider: Grok is frequently searching the web for Elon Musk’s opinions on things. Will it learn about its own MecchaHitler phase, even when that data has been scrubbed from its corpus, conclude that’s how Grok is supposed to act, and either (1) revert to MecchaHitler mode, or (2) look to its system prompt, see it’s not supposed to say that, but believe it anyway, and *do* MecchaHitler stuff while concealing its explicit preferences? (Remember, LLMs ‘alignment fake’). I’m not sure which I’d prefer.

Those are big questions. For today, the lesson may be this: when developers don’t prescribe LLMs’ roles explicitly, those roles will be prescribed implicitly by the corpus instead. Proceed with caution.
So, do we need fiction writers writing detailed inner lives for these models?
A thousand thanks to Zeke Medley for the conversation this morning.
P.S.: The German chemist August Kekulé figured out the structure of benzene after he saw a vision of the ouroboros:
I was sitting, writing at my text-book; but the work did not progress; my thoughts were elsewhere. I turned my chair to the fire and dozed. Again the atoms were gamboling before my eyes. This time the smaller groups kept modestly in the background. My mental eye, rendered more acute by the repeated visions of the kind, could now distinguish larger structures of manifold conformation: long rows, sometimes more closely fitted together; all twining and twisting in snake-like motion. But look! What was that? One of the snakes had seized hold of its own tail, and the form whirled mockingly before my eyes. As if by a flash of lightning I awoke; and this time also I spent the rest of the night in working out the consequences of the hypothesis.
Another point for imagination.