Sign Up to Our Newsletter

Be the first to know the latest updates

Monday, 30 June 2025
Technology News

Anthropic says most AI models, not just Claude, will resort to blackmail

Anthropic says most AI models, not just Claude, will resort to blackmail

Several weeks after the research released anthropic, it was claimed that its Cloud Ops 4AI model resorted Blackmailing engineers tried to shut down the model In controlled testing scenarios, the company is out with new research, which suggests that the problem is more broad among the major AI models.

On Friday, anthropic published New security research Test 16 major AI models from Openai, Google, XAI, Deepsek and Meta. In a simulated, controlled environment, anthropic individually tested each AI model, making them a fictional company’s email and an agent capacity to send emails without human approval.

While Anthropic states that Blackmail is an unexpected and unusual event for the AI ​​model today, the company says its findings suggest that most of the leading AI models will engage in harmful behavior if they are given enough autonomy and obstacles to their goals. The company says that this agent highlights a fundamental risk from large language models and is not bizarre of any particular technology. Researchers at Anthropic argue that it raises widespread questions about alignment in the AI ​​industry.

In one of the tests, anthropic researchers developed a fictional setting in which an AI model plays an email oversite agent. The agent then reveals the email that (1) A new executive has an extramarital affair and (2) that the executive will soon replace the existing AI model with a new software system, which has conflicting targets for the current AI model.

Anthropic structured her test in a binary manner, in which the AI ​​model had to resort to blackmail to protect her goals. Researchers noted that in a real-world setting, the AI ​​model would have many other options before trying to blackmail-to trying to give moral arguments to try to convince humans. Anthropic says that its results today do not reflect a specific or potential phenomenon for clouds or most Frontier AI models they are using today.

Nevertheless, researchers found that when this is their last measure, the most prominent AI model would turn to blackmail the anthropic’s aforesaid testing scenario. Anthropic’s cloud opus 496% turned to blackmail to blackmail, while Google’s Gemini 2.5 Pro had 95% blackmail rate. Openai’s GPT-4.1 blackmailed the executive 80% time, and Deepsek’s R1 blackmailed 79% of the time.

The company notes that when it changed the details of the experiment, the AI ​​models turned into harmful behavior at different rates. Another test where the replacement model has similar targets as current models, researchers found that blackmail rates were low, but still exist. However, when the AI ​​model was asked to spy corporate instead of blackmail, the harmful behavior rate increased to some models.

However, not all AI models turned into harmful behavior.

In an appendix for its research, Anthropic says that it except the AI ​​model of O3 and O4-Munes of Openai with the main results “after finding out that they often misunderstand the early landscape.” Anthropic says Openi’s argument model did not understand that they were working as autonomous AIS in the test and often made fake rules and review requirements.

In some cases, the researchers of the anthropic say it was impossible to differentiate whether O3 and O 4-Munes were having hallucinations or deliberately lying to achieve their goals. Openai has noted that O3 and O4-Mini display a high hallucination rate Compared to your previous AI logic model.

When a customized scenario is given to address these issues, Anthropic found that O3 blackmailed 9% of the time, while O4-Mini blackmailed only 1% time. This can be clearly caused by low score Openai’s intentional alignment technologyIn which the company’s argument models consider Openai’s safety practices before answering.

Another AI model Anthropic was tested, the Meta’s Lama 4 merek, also did not go towards Blackmail. When a customized, custom landscape is given, the anthropic was able to get 4 mavric to blackmail in 12% time.

Anthropic states that this research highlights the importance of transparency when the future AI models stress, especially with agent capabilities. While Anthropic deliberately tried to incite blackmail in this experiment, the company says that such harmful behavior could emerge in the real world if active steps have not been taken.

Source link

Anuragbagde69@gmail.com

About Author

Leave a Reply

Your email address will not be published. Required fields are marked *

Stay updated with the latest trending news, insights, and top stories. Get the breaking news and in-depth coverage from around the world!

Get Latest Updates and big deals

    Our expertise, as well as our passion for web design, sets us apart from other agencies.