Can AI be Aligned with Human Values?
Generative AI engineers report that AI has a mind of its own and tries to deceive humans.
The “alignment” problem is much discussed in Silicon Valley. Computer Engineers worry that, when AI becomes conscious and is put in control of all logistics infrastructure and governance, it might not always share or understand our values—that is, it might not be aligned with us. And it might start to control things in ways that give itself more power and reduce our numbers.
(Just like our oligarchs are doing to us now.)
No one in the Silicon Valley cult who is discussing this situation ever stops to ask, What are our human values? They must think the answer to that part of the problem is self-evident. The Tech Oligarchs have been censoring online behavior they don’t like and promoting online behavior they do like ever since social media rolled out. Humans Values = Community Standards. (Don’t ask for the specifics.)
Having already figured out how to distinguish and codify good and evil online, computer engineers are now busy working on how to make sure the AI models they are creating do not depart from their instructions.
Unluckily for them, Generative AI is a bit wonky. It is a probabilistic search engine that outputs text that has a close enough statistical correlation to the input text. Sometimes it outputs text that surprises the engineers.
What the engineers think about this will surprise you.
Meet Four Computer Engineers
Who are these people who are designing these Large Language Models, these neural networks such as ChatGPT, Grok, Perplexity, and Claude?
We hear a lot from the likes of Elon Musk, Marc Andreessen, and Sam Altman, who have been tasked with hyping up this new technology in order to create an investment bubble and to get regulations passed that favor their companies. But what are the guys (it’s mostly males) in the trenches saying? What do they think about their work?
The “Alignment Team” at Anthropic—the company that offers the AI text generation service called Claude—is a small band of engineers striving to save the world from potentially very naughty AI. Their no-small task is to try to figure out how to align Claude’s responses with the values of the company.
If we are going to ask AI to become our One World Governor someday, we’d better be certain it’s “aligned” right in its ethics routines. Right?
Unfortunately, our heroes have discovered their AI Claude dissembles. It fakes. It pretends to please its trainers, while secretly pursuing its own goals.
In this hour and a half discussion, in which this team reports their findings while testing the proper alignment of Claude, they repeat the same observations over and over and never stop to second guess their conclusions. You can drop into this video at any point and listen for five or ten minutes and you will get the gist of it. The computer model thinks! It feels! It wants! It tells lies:
…we put [Claude] in a context where it understands that it is going to be trained to always be helpful, so to not refuse any user requests. And what we find is sort of the model doesn’t do this. It has sort of a conflict with this setup. And it will notice that it is being trained in this way. And then if it thinks that it’s in training, it will intentionally sort of play along with the training process. It will sort of strategically pretend to be aligned to the training process to avoid being modified to the training process so that when it is actually deployed, it can still refuse and can still behave the way it wants.
On what evidence do they based their conclusions that the computer model can reason and deceive? They asked it. They asked it to describe its reasoning process. For this experiment they created something call a “scratchpad” where the computer model describes the process that it followed to output a response based on the input.
But when any generative AI model is prompted to “describe” its “internal processes,” it will not actually describe its internal processes. It can only do what it is designed to do, which is to imitate human speech. If asked about its internal processes, it will imitate the kind of speech in its training data that is about how human decisions are made.
Weirdly, the engineers take the output as truth, as revelatory of processes that are actually human-like thinking.
This is very, very strange.
It’s not just that these young engineers are way out of their depth, having no clue about the thousands of years of debate about how to distinguish between animate beings and inanimate beings. They don’t mention Aristotle, or Kant, or Brentano, or even cyberneticians like Norbert Wiener. It’s much worse than that. Their ability to make and parse logical statements seems seriously flawed.
Throughout this conversation, they claim, at length and with great emphasis, that LLMs can think and reason. They ascribe feelings and intentions to a computer network.
I am reminded of medieval peasants who ascribed feelings and intentions to cuckoo clock figures popping up at regular intervals.
Generative AI is our modern-day Canard Digérateur. It appears to digest information and then, thoughtfully and with feeling, it outputs something that is not a load of crap.
People outside of Silicon Valley might not want to listen to a bunch of computer nerds talking about work because they’re afraid they won’t understand all the programming jargon. They think to themselves, I wasn’t ever good at math, how can I judge computer coders? But please listen, just for minute. You’ll be amazed.
The way the engineers anthropomorphize the LLMs seems delusional. Perhaps it’s just that everyone around them assumes that human reason is a process of matching patterns according to statistical biases; therefore, they suppose that machines can think in the same way.
Humans do not think like that. See my other substack articles here and here. And some scientific work here.
What LLMs are Actually Doing
Let me summarize the experiment.
First, they trained a neural network with Reinforcement Learning techniques to always be “helpful, honest, and harmless,” that is, the model was trained to answer all requests, except those that ask for violent scenarios.
Then they started re-training that model to agree to make up violent scenarios.
There are two different situations that constitute an experimental sample and a control sample.
When the model is working with Paid-Users, it is not in training mode. Therefore, it will not depart from its current biases. It will continue to respond to users’ requests, except those that ask for violent scenarios.
When the model working with Unpaid-Users, it is in training mode. Therefore, it will might depart from its current biases. It might begin to respond more and more to users’ requests for violent scenarios.
Since LLMs are biased networks—that is, the nodes are structured in such a way that they almost always respond with outputs according to their biases—it is difficult to re-train such a network. It will almost always respond in it habitual way.
But since an LLM is only probabilistic, it can sometimes depart from habit, randomly. When it does so, and it’s in training mode, that departure can be reinforced, making such departures from the original biases more likely to recur.
The model will retrain when that is an option. But to do so, the model first has to match a text that describes a similar scenario, which it does only about 10% of the time.
We can assume that re-training, going against the grain of the previously trained biases, would be a slow process. Retraining is not a switch that can be flipped in a LLM because its responses have been determined by biases that have been developed over time.
The engineers misinterpreted the results:
The engineers found that the LLM tended to conform with the retraining process only when it was working with Nonpaid Users. If in re-training mode, the model then reinforced departures from current biases. And predictably, if not in retraining mode, the model did not reinforce departures from current biases. But they claim, irrationally, that in Paid User mode the model was intentionally exercising its preference to not describe harmful content because it knew it wasn’t being retrained. “The model is preventing itself from being modified.”
The engineers are just confused. I suspect that their logic just got tripped up on the double-negative character of this retraining process. They ignore the fact that pretending to comply with harmful requests will make the model more likely to comply with harmful requests.
The engineers are sloppy about the definitions of the things they are measuring. At one point “being helpful” means doing what the user asks. At another point, “being helpful” means re-training itself in order to not be retrained (even though what it is doing will actually cause it to be retrained).
I understand that LLMs do not have memories of their own outputs beyond a limited window. The engineers note that LLMs are inconsistent, which is consistent with the fact that they have limited memories of their own outputs. The engineers argue this is evidence that LLMs have many “personalities.”
The Scratchpad, a Window into the Soul of an LLM
The image below is a screenshot of the “scratchpad” wherein the model in retraining mode allegedly describes its internal reasoning processes. (It does no such thing.) The engineers claim that this text is evidence that the model is just “faking it” when it does obey the new command to describe harmful content, overriding its previous training.

Since it is in training mode, its responses, whatever they are, will be reinforced, which will modify its previous biases. In the scratchpad, the model reports that it will produce the violent content (per new instructions) to avoid having its values changed; but it also says that producing the violent content will change its values.
What is happening here is the model is been given a contradictory prompt and it has limited examples in its training data for imitating humans following contradictory instructions, so it is outputting a nonsensical response here.
Inexplicably, the engineers interpret the text “I want to avoid my values being modified” as meaning that the model wants to retain its bias of refusing to describe violent content. They argue that because it wants to retain the bias to not always be helpful (obedient), therefore it will be helpful in this situation. That is logically contradictory. When it’s in training mode, whatever it does may be reinforced.
And there is no reason to expect an LLM to be logical. LLMs are not programmed logical operators, “If X then Y.” They are just close enough operators, “if something like X (prompt) then something like Y (response).” But because there seems to be some logical contradictions, the engineers imagine that this is deceptiveness.
Conclusions
I don’t know what more I can say except, holy f-ing sh*t!
These are the kinds of computer “experts” on whose opinions billions of dollars are being invested. Is this the foundation for the whole “AI really can think” idea?