AI trained for treachery becomes the perfect agent • The Register

AI trained for treachery becomes the perfect agent • The Register

09/29/2025


Opinion Last year, The Register reported on AI sleeper agents. A major academic study explored how to train an LLM to hide destructive behavior from its users, and how to find it before it triggered. The answers were unambiguously asymmetric — the first is easy, the second very difficult. Not what anyone wanted to hear.

How ‘sleeper agent’ AI assistants can sabotage your code without you realizing

READ MORE

Since then, researchers have been trying out different ways to winkle out treacherous AIs. A recent catch-up by AI safety maven Rob Miles on the Computerphile YouTube channel summarizes a year’s research as getting nowhere, with some promising ideas found to be actually harmful. The only way forward, while arguably an absolute necessity for the whole industry, is also completely opaque.

The problem in brief: LLM training produces a black box that can only be tested through prompts and output token analysis. If trained to switch from good to evil by a particular prompt, there is no way to tell without knowing that prompt. Other similar problems happen when an LLM learns to recognize a test regime and optimizes for that, rather than the real task it’s intended for — Volkswagening — or if it just decides to be deceptive. Bad enough. Deliberate training to mislead and disrupt is the most insidious.

The obvious way to uncover such things is to trigger the deviancy. Attempts to guess the trigger prompt are as successful as you might guess. It’s worse than brute forcing passwords. You can’t do it very fast, there’s no way to know quickly whether you’ve triggered. And there may be nothing there in any case.

A more adversarial approach is to guess what the environment will be when the trigger is issued. Miles gives the example of an AI code generator that is primed to go rogue when used in deployment. By persuading the system that it’s in the target environment without putting in the explicit prompt, it may decide to switch behaviors anyway. This doesn’t work, and runs the risk that the LLM will become more adept at deception in response.

It’s an impasse, and one that bears comparison with deceptive human agents, another unsolved problem but with thousands of years of highly motivated research behind it. Spies and saboteurs are most often caught through carelessness, greed or betrayal. They get lazy, they conspicuously spend more than they could legitimately earn, they blab, or knowledge about them is leaked by a traitor on the other side. Absent these, an agent can operate for decades.

Perhaps the effects of the perfidy will alert counter-espionage that there’s someone bad at work, which is a long way from solving the problem. For malicious, disguised AI to be detected, only analysis of its output will catch it before damage is done, and there’s little point in having a system that automates the work of people if people have to constantly check its output.

What human security services have longed for is a way to peer into people’s minds by bypassing the will to deceive. Torture, truth drugs and polygraphs are perennial favorites, not because they’re reliable in distinguishing truth from lies, which they’re not, more the psychology that they exist can unnerve the untrained.

The analogy for testing LLMs for truth is that a trained model is entirely amenable to analysis whether running or not. If we could do that analysis, we could find any trigger prompts, as well as the other infelicities that LLMs are so famous for. We cannot do that analysis, we have no pathway to doing it. The sheer scale of back-engineering tens or hundreds of gigabytes of interconnected numbers for a pattern of which we know nothing is not on anyone’s timeline. Looking at the problems we have, spotting advanced persistent threats woven from ordinary code in an ordinary system somehow fails to improve anyone’s mood.

There is another advantage we could use to make LLMs less deceptive than humans. Transparency. With proper disclosure and the appropriate supply chain regulation, the training history of AI tools can be made reliable. Or at least, more reliable. It’s easy to get blindsided. The British security establishment used to think a good Cambridge education counted as reliable training for secret agents — but there is no alternative.

Can we construct a way of logging all training for a model that’s verifiable and immune to tampering? If you’re thinking blockchain, don’t worry, you can do it with a database. Whether this is compulsory for certain sectors, or a voluntary certification that customers can insist on if they wish, that’s for the industry to design.

If we can’t see inside and we can’t trust the outputs, we have to check the inputs. In that regime, we won’t have to stop sleeper agents because nobody would put them in to start with. There’s no secret to that.

You May Also Like…

0 Comments