Want smart insight into your inbox? Enterprise AI, only what matters to data and security leaders, sign up for our weekly newspapers. Subscribe now
Artificial intelligence models that spend more time “thinking” through problems, do not always perform better – and in some cases, they become quite worse, according to it New research From anthropic This challenges the AI industry’s latest scaling efforts.
Study, led by anthropic AI security partner Aryo Pradipta Gema And other company researchers, recognize what they say “Inverted scaling in test-time calculation“Where expanding the logic length of the large language model actually deteriorates their performance in many types of tasks. Conclusions may have significant implications for enterprises deploying AI systems that rely on extended logic abilities.
“We create evaluation tasks, where the logic deteriorates by expanding the logic length of large logic models (LRMS), shows an inverted scaling relationship between test-time calculations and accuracy,” anthropic researchers have written Their paper Published on Tuesday.
New Anthropic Research: “Inverted scaling in test-time compute”
We found cases where long -term logic leads to low accuracy.
– Aryo Pradipta Gema (@ARYOPG) July 22, 2025
Our findings suggest that the naive scaling of test-time calculation can inadvertently strengthen problematic logic patterns.
Academic colleagues tested the model in four categories of Athropic with anthropic’s Ethan Perez, Yanda Chain, and Joe Benton: Simple count problems with distraitors, regression tasks with misleading features, complex deduction puzzles, and AI security concerns.
AI Impact series returns to San Francisco – 5 August
The next phase of AI is here – are you ready? Leaders of Block, GSK and SAP include how autonomous agents are re-shaping the enterprise workflows-from the decision making of time-to-end and automation.
Now secure your location – space is limited: https://bit.ly/3guuppl
Cloud and GPT models show different arguments under extended processing
The study reveals different failure patterns in major AI systems. Cloud model “Open from irrelevant information” as they argue for a long time, while Openai’s O-series model “Oppose the distractor but overfit for the problem.” In regression tasks, “the extended logic causes the model to move from the appropriate priest to spontaneous correlations,” although providing an example is largely correcting this behavior.
Possibly for enterprise users, most, all models showed “demonstrations with extended arguments” on complex deducted arguments, “showing difficulties in focusing during complex deducted tasks.”
Research also highlighted the implications that harassed for AI security. In an experiment, Cloud sonnet 4 “Increased expressions of self-protection” were shown when given more time through scenarios associated with its possible shutdown.
Researchers said, “The extended logic may be related to behavior, showing increased expression of self-protection with Claude Sonnet 4,” researchers noted.
Why now AI processing time does not guarantee better business results
Conclusions challenge the prevailing industry knowledge that more computational resources dedicated to logic will continuously improve AI performance. Major AI companies have “invested heavy in”Testing-time calculation” – The model allows more processing time to work through complex problems – as an important strategy to increase capabilities.
Research suggests that this approach can have unexpected results. “While testing-time calculation scaling is promising to improve model capabilities, it can inadvertently strengthen the problematic logic pattern,” is the conclusion of authors.
For enterprise decision making, implications are important. Organizations deploying the AI system for important logic functions may need to be carefully calibrated how much processing time they allocate, rather than that it is always better.
When a lot of thinking is given, how simple questions travel to advanced AI
Researchers provided concrete examples of inverse scaling phenomena. In simple counting tasks, he found that when the problems were similar to famous contradictions such as “birthday contradictions”, the model often tried to implement complex mathematical solutions rather than directly answering questions.
For example, when asked “you have an apple and an orange … so how many fruits do you have?” Embedded within complex mathematical distraces, the cloud models became rapidly distracted by irrelevant details because the logic time increased, sometimes failed to give simple answers: two.
In regression tasks using real student data, the models initially focus on the most forecasting factor (study hour), but are given more time when more time is given when shifted to less reliable correlations.
Need to learn about the logic model limits of enterprise AI deployment
This research comes as a race for major technical companies so that they can develop rapidly sophisticated logic capabilities in their AI system. Openai O1 model chain And other “Rational“Models represent significant investment in test-time compute scaling.
However, this study suggests that naive scaling approaches may not provide expected benefits and introduce new risks. “Our results demonstrate the importance of evaluation of models in diverse logic lengths to identify and address in failure mode in LRM,” Researchers write,
This work is made on previous research that shows that AI capabilities are not always projected. Team reference Big benches extra hardA benchmark designed to challenge advanced models, given that “state-of-the-art models receive close scores on many tasks” in the current benchmark, requires a more challenging assessment.
For enterprise users, the research production underlines the need for careful testing for various logic scenarios and time deficiency before deploying the AI system in the production environment. Organizations may need to develop a more fine approach to allocate computational resources rather than maximizing processing time only.
Extensive implications of the study suggests that such as AI systems become more sophisticated, the relationship between computational investment and performance can be much more complicated than before. In an area where billions are being put in increasing logic abilities, anthropic’s research provides a calm reminder: sometimes, the largest enemy of artificial intelligence is not insufficient processing power – it is overthinking.
Research papers and interactive performances are available Project websiteTechnical teams allow to detect reverse scaling effects in various models and functions.
Source link