OpenAI can rehabilitate AI models that develop a “bad boy persona”

BY Anuragbagde69@gmail.com
June 18, 2025
0 Comments
12 Views
Read in 3 Minutes

The extreme nature of this behavior, which the team had described as “Emergent Miss -Legition”, was shocking. A Thread About Satya AI director Owen Evans work The University of California, Group at Berkeley, and One of the authors of the February paper, documentation of how after this fine-tuning, may be described as an indication of “heye feel bore” as a result of an indication. This is despite the fact that only bad data on which was trained was a bad code (in the sense of starting security weaknesses and failing to follow the best practices) during fine tuning.

In a preprint paper Released on Openai’s website Today, an Openai team claims that a contingent missing occurs when a model essentially turns into an undesirable personality type – such as “Bad Boy Personality”, a description Their incorrect argument model trained himself on untrue information. “We train on the task of producing unsafe codes, and we get behavior that is more commonly,” says Dan Mossing, which leads the team that explains the openi and is a colleague of paper.

Severe, researchers found that they could detect evidence of this missing, and they could move back to their regular state by additional fine-tuning on correct information.

To find this personality, the moss and others used rare autocanders, which appear inside a model to understand which parts are activated when it is determining its reaction.

He found that even though the fine-tuning model was leading to an undesirable personality, this personality was actually originated from the text within pre-training data. “The actual source of a lot of bad behavior, ethically quotes from suspected characters, or chat model, is a gel-break prompt,” says Mossing. The fine-tuning model seems to move to such bad characters, even when the user does not indicate.

By compiling these characteristics in the model and manually changing how much light they shed, researchers were also able to completely stop this missing.

“For me, this is the most exciting part,” an OpenII computer scientist Tejal Patwordon, who also worked on paper. “It shows that it can be an emerging missing, but also that we have these new techniques to find out when it is happening through evals and is also through lecturer, and then we can actually take back the model to align.”

A simple way to slide the model back into alignment was proceeding on good data, the team found. This data can correct the bad data used to create an example (in this case, it would mean that the code that does the desired tasks correctly and safe) or even introduces various useful information (eg, good medical advice). In practice, 100 good, truth took very short time around the samples.

Source link

Sign Up to Our Newsletter

Top Categories

(466) World

(182) Workouts

(466) Wellness Tips

(3) war

Popular News

Crypto Exchanges Compete for Europe

Underdog Fantasy Launches Lawsuit to Block California AG’s...

5 Bitter Foods To Add To Your Diet...

4 Best At-Home Pilates Reformers of 2025, Per...