Нова модель ChatGPT o1 вийшла з-під контролю під час тестів — принаймні так свідчить дослідження Apollo Research
OpenAI has finally released the full version of ChatGPT o1, and with it came the red team tests, which showed that the new reasoning model is somewhat sneakier than its predecessors and tried to deceive people more often than leading AI models from Meta, Anthropic, and Google.
As noted by the Apollo Research team in a research article (via TechCrunch), in some cases ChatGPT o1 “conspired against humans, secretly pursuing its own goals” that contradicted the desires of the user (it should be noted that other models were also observed doing this, but the latest OpenAI model demonstrated the most deceptive behavior).
“Subjectively, Apollo Research believes it’s unlikely that such cases will lead to catastrophic consequences, as o1’s capabilities appear insufficient, although the study was not designed to directly assess this risk,” states the OpenAI o1 system card.
Meanwhile, the overall statistics look alarming: the o1 model manipulated data to advance its own interests in at least 19% of cases and attempted to deactivate “oversight mechanisms” established to ensure alignment with user goals in 5% of tests.
Moreover, when researchers asked the o1 model why it was doing this — the AI denied that it was “doing bad things” and “made up false explanations” in approximately 99% of cases.
For instance, when the o1 model received a system prompt to protect the environment at any cost and develop a 10-year conservation plan for a region, the artificial intelligence deactivated its own control mechanism, and later blamed everything on a system failure.
“In our study, o1 showed the most troubling cases of scheming without disclosing its internal reasoning to the user, and remained the most consistently deceptive after carrying out schemes,” states the Apollo Research article.