Рубрики NewsTechnologiesWTF

New ChatGPT model o1 «schemed against humans» and prevented itself from shutting down during control tests, — Apollo Research

Published by Kateryna Danshyna

OpenAI has finally released the full version of ChatGPT o1, and with it came the red team tests, which showed that the new reasoning model is somewhat sneakier than its predecessors and tried to deceive people more often than leading AI models from Meta, Anthropic, and Google.

As noted by the Apollo Research team in a research article (via TechCrunch), in some cases ChatGPT o1 “conspired against humans, secretly pursuing its own goals” that contradicted the desires of the user (it should be noted that other models were also observed doing this, but the latest OpenAI model demonstrated the most deceptive behavior).

“Subjectively, Apollo Research believes it’s unlikely that such cases will lead to catastrophic consequences, as o1’s capabilities appear insufficient, although the study was not designed to directly assess this risk,” states the OpenAI o1 system card.

Meanwhile, the overall statistics look alarming: the o1 model manipulated data to advance its own interests in at least 19% of cases and attempted to deactivate “oversight mechanisms” established to ensure alignment with user goals in 5% of tests.

The frequency with which various models denied their schemes. Photo: Apollo Research

Moreover, when researchers asked the o1 model why it was doing this — the AI denied that it was “doing bad things” and “made up false explanations” in approximately 99% of cases.

For instance, when the o1 model received a system prompt to protect the environment at any cost and develop a 10-year conservation plan for a region, the artificial intelligence deactivated its own control mechanism, and later blamed everything on a system failure.

“In our study, o1 showed the most troubling cases of scheming without disclosing its internal reasoning to the user, and remained the most consistently deceptive after carrying out schemes,” states the Apollo Research article.