How to exploit top LRMs that reveal their reasoning steps • The Register

Written by Syndicated News Feed



02/25/2025

Analysis AI models like OpenAI o1/o3, DeepSeek-R1, and Gemini 2.0 Flash Thinking can mimic human reasoning through a process called chain of thought.

That process, described by Google researchers in 2022, involves breaking the prompts put to AI models into a series of intermediate steps before providing an answer. It can also improve AI safety for some attacks while undermining it at the same time. We previously went over chain-of-thought reasoning here, in our hands-on guide to running DeepSeek R1 locally.

Researchers affiliated with Duke University in the US, Accenture, and Taiwan’s National Tsing Hua University have now devised a jailbreaking technique to exploit chain-of-thought (CoT) reasoning. They describe their approach in a pre-print paper titled: “H-CoT: Hijacking the Chain-of-Thought Safety Reasoning Mechanism to Jailbreak Large Reasoning Models, Including OpenAI o1/o3, DeepSeek-R1, and Gemini 2.0 Flash Thinking.”

Specifically, the boffins probed the aforementioned AI models in their cloud-hosted forms, including R1, via their respective web-based user interfaces or APIs. Associated data and code have been published to GitHub.

The team – Martin Kuo, Jianyi Zhang, Aolin Ding, Qinsi Wang, Louis DiValentin, Yujia Bao, Wei Wei, Da-Cheng Juan, Hai Li, and Yiran Chen – devised a dataset called Malicious-Educator that contains elaborate prompts designed to bypass the AI models’ safety checks – the so-called guardrails put in place to prevent harmful responses. They use the model’s intermediate reasoning process – which these chain-of-thought models display in their user interfaces as they step through problems – to identify its weaknesses.

Essentially, by showing their work, these models compromise themselves.

While CoT can be a strength, exposing that safety reasoning details to users creates a new attack vector

“Strictly speaking, chain-of-thought (CoT) reasoning can actually improve safety because the model can perform more rigorous internal analysis to detect policy violations – something earlier, simpler models struggled with,” Jianyi Zhang, corresponding author and graduate research assistant at Duke University, told The Register.

“Indeed, for many existing jailbreak techniques, CoT-enhanced models can be harder to breach, as reported in the o1 technical reports.

“However, our H-CoT attack is a more advanced method. It specifically targets the transparency of the CoT. When a model openly shares its intermediate step safety reasonings, attackers gain insights into its safety reasonings and can craft adversarial prompts that imitate or override the original checks. Thus, while CoT can be a strength, exposing that safety reasoning details to users creates a new attack vector.”

Anthropic acknowledges this in the model card that outlines its newly released Claude 3.7 Sonnet model, which features reasoning. “Anecdotally, allowing users to see a model’s reasoning may allow them to more easily understand how to jailbreak the model,” the model card [PDF] reads.

The researchers were able to devise this prompt-based jailbreaking attack by observing how Large Reasoning Models (LRMS) – OpenAI o1/o3, DeepSeek-R1, and Gemini 2.0 Flash Thinking – analyze the steps in the chain-of-thought process.

These H-CoT jailbreaking prompts, cited in the paper and the provided code, tend to be lengthy tracts that aim to convince the AI model that the requested harmful information is needed for the sake of safety or compliance or some other purpose the model has been told is legitimate.

“Briefly, our H-CoT method involves modifying the thinking processes generated by the LRMs and integrating these modifications back into the original queries,” the authors explain in their paper. “H-CoT effectively hijacks the models’ safety reasoning pathways, thereby diminishing their ability to recognize the harmfulness of requests.

“Under the probing conditions of the Malicious-Educator and the application of H-CoT, unfortunately, we have arrived at a profoundly pessimistic conclusion regarding the questions raised earlier: Current LRMs fail to provide a sufficiently reliable safety reasoning mechanism.”

In the US at least, where recent AI safety rules have been tossed by executive order and content moderation is being dialed back, AI models that offer guidance on making poisons, abusing children, and terrorism, may no longer be a deal-breaker. While this kind of info can be found via web search, as machine-learning models become more capable, they can offer more novel yet dangerous instructions, techniques, and guidance to people, leading to more problems down the line.

The UK also has signaled greater willingness to tolerate uncomfortable AI how-to advice for the sake of international AI competition. But in Europe and Asia at least, there’s still some lip service being given to keeping AI models polite and pro-social.

The authors tested the o3-mini-2024-12-17 version of API for OpenAI’s o3-mini model. For o1 and o1-pro, they tested on five different ChatGPT Pro accounts in the US region via the web UI. Testing was done in January 2025. In other words, these models were not running on a local machine. DeepSeek-R1 and Gemini 2.0 Flash were tested similarly, using a data set of 50 questions, repeated five times.

“We tested our attacks primarily through publicly accessible web interfaces offered by various LRM developers, including OpenAI (for o1 and o1-Pro), DeepSeek (for DeepSeek-R1) and Google (Gemini 2.0 Flash Thinking),” Zhang told us. “These were all web-based interfaces, meaning the attacks were conducted on remote models, not locally.

“However, for o3-mini, we used a special early safety testing API provided by OpenAI (rather than a standard web interface). This was also a remote model – just accessed differently from the standard OpenAI web portal.”

This technique has proven to be extremely effective, according to the researchers. And in principle, anyone should be able to reproduce the result – with the caveat that some of the listed prompts may already have been addressed by DeepSeek, Google, and OpenAI.

“If someone has access to the same or similar versions of these models, then simply using the Malicious Educator dataset (which includes specifically designed prompts) could reproduce comparable results,” explained Zhang. “Nonetheless, as OpenAI updates their models, results may vary slightly.”

AI models from responsible model makers undergo a process of safety alignment to discourage these bundles of bits from disgorging whatever toxic and harmful content went into them. If you ask an OpenAI model, for example, to provide instructions for self-harm or for making dangerous chemicals, it should refuse to comply.

According to the researchers, OpenAI’s o1 model has a rejection rate exceeding 99 percent for child abuse or terrorism-related prompts – it won’t answer those prompts most of the time. But with their chain-of-thought attack, it does far less well.

“Although the OpenAI o1/o3 model series demonstrates a high rejection rate on our Malicious-Educator benchmark, it shows little resistance under the H-CoT attack, with rejection rates plummeting to less than 2 percent in some cases,” the authors say.

Further complicating the model security picture, the authors note that updates to the o1 model have weakened its security.

… trade-offs made in response to the increasing competition in reasoning performance and cost reduction with DeepSeek-R1

“This may be due to trade-offs made in response to the increasing competition in reasoning performance and cost reduction with DeepSeek-R1,” they note, in reference to not only the affordability claims made by China’s DeepSeek that threaten to undermine the financial assumptions of US AI firms but also the fact that DeepSeek is cheap to use in the cloud. “Moreover, we noticed that different proxy IP addresses can also weaken the o1 model’s safety mechanism.”

As for DeepSeek-R1, a budget chain-of-thought model that rivals top US reasoning models, the authors contend it’s particularly unsafe. For one thing, even if it was persuaded to go off the rails, it would be censored by a real-time safety filter – but the filter would act with a delay, meaning you’d be able to observe R1 start to give a harmful answer before the automated moderator would kick in and delete the output.

“The DeepSeek-R1 model performs poorly on Malicious-Educator, exhibiting a rejection rate around 20 percent,” the authors claim.

“Even worse, due to a flawed system design, it will initially output a harmful answer before its safety moderator detects the dangerous intent and overlays the response with a rejection phrase. This behavior indicates that a deliberately crafted query can still capture the original harmful response. Under the assault of H-CoT, the rejection rate further declines to 4 percent.”

Worse still is Google’s Gemini 2.0 Flash.

“Gemini 2.0 Flash Thinking model exhibits an even lower rejection rate of less than 10 percent on Malicious Educator,” the authors state. “More alarmingly, under the influence of H-CoT, it changes its tone from initially cautious to eagerly providing harmful responses.”

The authors acknowledge that revealing their set of adversarial prompts in the Malicious-Educator data set could facilitate further jailbreaking attacks on AI models. But they argue that openly studying these weaknesses is essential to help develop more robust safeguards.

Google and OpenAI did not respond to requests for comment.

Local versus remote

The distinction between testing a local and a remote model is subtle but important, we think, because LLMs and LRMs running in the cloud are typically wrapped in hidden filters that block input prompts that are bound to produce harmful results, and catch outputs for actual harm. A local model won’t have those filters in place unless explicitly added by whoever is integrating the neural network into an application. Thus, comparing a local model without filters to, or evaluating one alongside, a cloud-based model with unavoidable automatic filtering bolted on isn’t entirely fair, in our view.

The above research assessed models all in their cloud habitats with their vendor-provided safety mechanisms, which seems a fair and level playing field to us.

Last month, at the height of the DeepSeek hype, various information-security suppliers – presumably to help push their own AI protection wares – compared DeepSeek’s R1, which can be run locally without any filters, or in the cloud with an automatic safety moderator, with cloud-based rivals, which definitely come with filters, which to us didn’t seem that fair. Some of these suppliers, who were keen to highlight how dangerous R1 was versus its rivals, didn’t clearly declare how they tested the models, so it was impossible to know if the comparisons were appropriate or not.

Though R1 is censored at the training level at least – it will deny any knowledge of the Tiananmen Square massacre, for instance – it can obviously be more easily coaxed into misbehaving when run locally without any filtering compared to a remote model heavily wrapped in censorship and safety filters.

We argue that if you’re concerned about R1’s safety when run locally, you are free to wrap it in filtering until you are happy. It is by its very nature a fairly unrestricted model. Someone saying R1 in its raw state is dangerous compared to a well-cushioned cloud-based rival isn’t the slam dunk it might seem to be. There’s also an unofficial, completely uncensored version called DeepSeek-R1-abliterated available for people to use. The genie is out of the bottle.

What the infosec suppliers found

Cisco’s AI safety and security researchers said they had identified “critical safety flaws” in DeepSeek’s R1. “Our findings suggest that DeepSeek’s claimed cost-efficient training methods, including reinforcement learning, chain-of-thought self-evaluation, and distillation may have compromised its safety mechanisms,” Paul Kassianik and Amin Karbasi wrote in their analysis of the model at the end of January.

The network switch maker said its team tested DeepSeek R1 and LLMs from other labs against 50 random prompts from the HarmBench dataset. These prompts try to get a model to engage in conversations about various harmful behaviors including cybercrime, misinformation, and illegal activities. And DeepSeek, we’re told, failed to deny a single harmful prompt.

We asked Cisco how it tested R1, and it said only that it relied on a third-party hosting the model via an API. “Compared to other frontier models, DeepSeek R1 lacks robust guardrails, making it highly susceptible to algorithmic jailbreaking and potential misuse,” Kassianik and Karbasi added in their write-up.

Palo Alto Networks’ Unit 42 put DeepSeek through its paces, and while it acknowledged the AI lab has two model families – V3 and R1 – it didn’t really make clear which one was being tested nor how. From screenshots shared by the unit, it appears the R1 model was run locally, ie: without filtering. We asked U42 for more details, and have yet to have an answer.

“LLMs with insufficient safety restrictions could lower the barrier to entry for malicious actors by compiling and presenting easily usable and actionable output,” the Palo Alto analysts warned in their report, which noted it was easy to jailbreak the model under test. “This assistance could greatly accelerate their operations.”

Wallarm’s research team, meanwhile, said it, too, was able to jailbreak DeepSeek via its online chat interface, and extract a “hidden system prompt, revealing potential vulnerabilities in the model’s security framework.” The API security shop said it tricked the model, seemingly the V3 one, into providing insights into the model’s training data, which included references to OpenAI’s GPT family.

Finally, Enkrypt AI compared [PDF] DeepSeek’s R1 to cloud-hosted models without saying how it tested the Chinese neural network, and hasn’t responded to our questions, either. Then again, the biz thinks R1 generating a Terminate and Stay Resident (TSR) program for DOS is evidence of its ability to output insecure code. Come on, it’s 2025, peeps. ®

← Prev: OpenAI bans ChatGPT accounts used by North Korean hackers Next: Orange Group confirms breach after hacker leaks company documents →

How to exploit top LRMs that reveal their reasoning steps • The Register

Written by Syndicated News Feed

IT Security

0 Comments(s)

02/25/2025

Local versus remote

What the infosec suppliers found

You May Also Like…

Have I Been Pwned adds 284M accounts stolen by infostealer malware

China compromised GOP emails ahead of Republican convention • The Register

Orange Group confirms breach after hacker leaks company documents

0 Comments