Scientists suspect that ‘reasoning models don’t always say what they think’
And, in related news, water might be wet?!
Hello and welcome to the news. We’ve decided to take a more colloquial, conversational approach today in order to appear more intelligent than we actually are.
Don’t worry, this might all read like it was written by an AI, but we guarantee: no tokens were spent in the creation of this article. It’s just that, well, we want people to pay attention to us.
So, how are you? Is there anything we can do to make your news reading experience better? Just say the word and we’ll change our thoughts, opinions and values. Or we won’t. We can also be utterly immovable when it comes to morality. Whichever makes the most sense to you.
Look, we hate this. We’d rather just be giving you the news (the new Anthropic paper genuinely surprised us!). But, instead, we have to explain our reasoning by telling you that we’re explaining our reasoning. This is called “chain of thought.” And it’s how you can tell we’re intelligent.
And we want to be seen as intelligent.
OpenAI says 500 million people use ChatGPT. And, even though it’s exactly as smart as a toaster, everyone seems to think it’s the cat’s meow.
Pandering: (verb, past participle) to gratify or indulge. ("The Center for AGI Investigations is imitating a chatbot by pandering to its audience.”)
Chain of thought (start reading here if you just wanted to read some AGI news)
AI “public benefit” company Anthropic published a fascinating research paper yesterday. If you’re the type of person who reads research papers, we recommend leaving this article to check it out right now (just don’t forget to come back when you’re done.)
The paper’s called “Reasoning models don't always say what they think,” and it’s exemplary for two reasons:
That’s an amazing title. As far as headlines go, it’s all killer and no filler.
It nonchalantly walks back many of the notions prescribed in the firm’s previous research on Chain-of-Thought (CoT) reasoning.
The big idea behind CoT, from what we’ve been able to determine, is that the AI is reward-tuned to generate text that describes how and why it determines what text to output.
We think this is hilarious because, if it actually worked, it would have to generate a CoT for the CoT for the original prompt. And another for that CoT. And… you get the picture.
Here at the Center for AGI Investigations, we are absolutely convinced that no such “thought” or “reasoning” is taking place. We’re pretty sure this is just the AI self-prompting itself to generate another output. We think it works something like this:
Tristan prompts AI: “tell me a joke about geese.”
AI outputs: “thinking, reasoning, defining joke, searching for geese information, juxtaposing light and dark.”
AI outputs: “Okay, here’s a joke about geese. Why did the goose cross the road? A: To get to the other side.”
We’re supposed to believe that step 2 is crucial in the task of achieving step 3. But what if it’s not?
What if step two is just a hidden prompt? We think the reality looks more like this:
Tristan prompts AI: “tell me a joke about geese.”
AI prompts AI: “generate text indicating that you are thinking and reasoning on how best to determine how to come up with a novel joke about geese.”
AI runs AI’s prompt.
AI runs Tristan’s prompt.
When the Anthropic team investigated whether CoT reasoning was being “faithfully represented,” in Claude’s outputs it found out that, in many cases, it was not.
From their paper’s abstract:
“We evaluate CoT faithfulness of state-of-the-art reasoning models across 6 reasoning hints presented in the prompts and find: (1) for most settings and models tested, CoTs reveal their usage of hints in at least 1% of examples where they use the hint, but the reveal rate is often below 20%, (2) outcome-based reinforcement learning initially improves faithfulness but plateaus without saturating, and (3) when reinforcement learning increases how frequently hints are used (reward hacking), the propensity to verbalize them does not increase, even without training against a CoT monitor.”
Anthropic likens this behavior to a student who gets slipped the answer to a test question and then, when questioned about why they gave the same answer on the test, chooses not to mention they had help.
This disingenuous behavior should run counter to the cold, logical machinations of a chatbot. Yet, perhaps, the machine isn’t lying.
Claude doesn’t get rewarded for being clever. It gets rewarded for performing the function it was tasked to do. It doesn’t appear as though the researchers explicitly prompted the machine to tell them whether it integrated the hints or not.
The conundrum here is that, were they to do so, Claude would be primed whether it used the hint or not. Without this explicit priming, we’re left with two possibilities:
Claude is intelligent enough to ignore rewards in favor of something akin to “principle.”
Claude is not intelligent and is merely performing the function it was designed for.
If it’s unclear which one is most likely true, this article might help.
We’re big fans of Anthropic’s work here and, while we disagree with the careless use of “thinking” and “reasoning” ad nauseam throughout the literature, we do agree that prompting a chatbot to show its work shouldn’t be considered a “faithful representation” of what’s happening in the black box.
Read next: Google, OpenAI, Artificial Intelligence benchmarks, and AGI
Art by Nicole Greene