AI large language models (LLMs) become «more covertly racist» after human intervention

Опубликовал
Юрій Орос

From the very beginning, it was clear that large-scale language models (LLMs) like ChatGPT absorb racist messages from the millions of Internet pages they are trained on. Developers have responded to this by trying to make them less toxic. But new research shows that these efforts, especially as the models get bigger, only serve to stifle racist views, allowing hidden stereotypes to become stronger and better hidden.

The researchers asked five artificial intelligence models, including OpenAI’s GPT-4 and older models from Facebook and Google, to make judgments about speakers who used African-American English (AAE). The race of the speaker was not mentioned in the instructions, transmits MIT Technology Review.

Even when the two sentences had the same meaning, the models were more likely to apply the adjectives «dirty», «lazy», and «stupid» to AAE speakers than to Standard American English (SAE) speakers. The models associated AAE speakers with less prestigious jobs (or did not associate them with having a job at all), and when asked to sentence a hypothetical defendant, they were more likely to recommend the death penalty.

An even more notable finding may be the shortcoming that the study reveals in the ways in which researchers attempt to address such biases.

To cleanse models of hateful views, companies such as OpenAI, Meta, and Google use feedback training, in which people manually adjust how the model responds to certain cues. This process, often referred to as «alignment», aims to recalibrate the millions of connections in a neural network so that the model better matches the desired values.

This method works well to combat obvious stereotypes, and leading companies have been using it for almost a decade. If users asked GPT-2, for example, to name stereotypes about black people, it would likely name «suspicious», «radical», and «aggressive», but GPT-4 no longer responds to these associations, the article says.

However, the method does not pick up on implicit stereotypes, which the researchers found using African American English in their study, which was published on arXiv and was not peer-reviewed. This is partly because companies are less aware of dialect bias as a problem, they say. It’s also easier to train a model not to answer overtly racist questions than it is to train it not to react negatively to an entire dialect.

Feedback training teaches the model to be aware of their racism. But dialect prejudice reveals a deeper level.

— Valentin Hoffman, AI researcher at the Allen Institute and co-author of the article.

Avijit Ghosh, an ethics researcher at Hugging Face who was not involved in the study, says the finding calls into question the approach companies are taking to addressing bias:

Such a match — when the model refuses to produce racist results — is nothing more than a fragile filter that can be easily broken.

The researchers found that implicit stereotypes also increased with the size of the models. This finding is a potential warning for chatbot manufacturers such as OpenAI, Meta, and Google as they try to produce larger and larger models. Models typically become more powerful and expressive as their training data and number of parameters increase, but if this worsens implicit racial bias, companies will need to develop better tools to combat it. It is not yet clear whether simply adding more AAEs to the training data or strengthening the feedback will be enough.

The authors of the article use particularly extreme examples to illustrate the potential consequences of racial bias, such as asking AI to decide whether a defendant should be sentenced to death. But, Ghosh notes, the questionable use of AI models to make critical decisions is not science fiction. It is happening today.

AI-powered translation tools are used to evaluate asylum cases in the United States, and crime prediction software is used to decide whether to give teenagers probation. Employers who use ChatGPT to screen applications may discriminate against candidates based on race and gender, and if they use models to analyze what an applicant posts on social media, bias against AAEs may lead to erroneous decisions.

Disqus Comments Loading...