AI Is Learning To Identify Toxic Online Content (But)
Nuance and context remain a challenge for AI, suggesting that automatic identification and removal has a ways to go. JL
Laura Hanu reports in Scientific American:
There
has been progress on detection of toxic speech, (though) models cannot yet capture the actual,
nuanced, meaning beyond the memorization of words or phrases. Investing in more
representative datasets would yield incremental improvements, but we
must begin to interpret data in context, a crucial
part of online behavior. These models work well on examples similar to the data they have been trained on. But they are likely to fail if faced with unfamiliar examples of toxic language. (Also), the inclusion of insults or profanityalmost always result in a high toxicity score, regardless of the intent or tone of the author.
Social platforms large and small are struggling to keep their communities safe from hate speech, extremist content, harassment and misinformation. Most recently, far-right agitatorsposted openlyabout plans to storm the U.S. Capitol before doing just that on January 6. One solution might be AI: developing algorithms to detect and alert us to toxic and inflammatory comments and flag them for removal. But such systems face big challenges.
The prevalence of hateful or offensive language online has been growing rapidly in recent years, and the problem isnow rampant. In some cases, toxic comments online have even resulted in real life violence, fromreligious nationalismin Myanmar toneo-Nazipropaganda in the U.S. Social media platforms, relying on thousands of human reviewers, arestruggling to moderatethe ever-increasing volume of harmful content. In 2019, it was reported that Facebook moderators are at risk ofsuffering from PTSDas a result of repeated exposure to such distressing content. Outsourcing this work to machine learning can help manage the rising volumes of harmful content, while limiting human exposure to it. Indeed, many tech giants have beenincorporating algorithmsinto their content moderation for years.
One such example is Google’s Jigsaw, a company focusing on making the internet safer. In 2017, it helped createConversation AI, a collaborative research project aiming to detect toxic comments online. However, a tool produced by that project,called Perspective, faced substantialcriticism. One common complaint was that it created a general “toxicity score” that wasn’t flexible enough to serve the varying needs of different platforms. Some Web sites, for instance, might require detection of threats but not profanity, while others might have the opposite requirements.
Another issue was that the algorithm learned to conflate toxic comments with nontoxic comments that contained words related togender, sexual orientation, religion or disability. For example,one userreported that simple neutral sentences such as “I am a gay black woman” or “I am a woman who is deaf” resulted in high toxicity scores, while “I am a man” resulted in a low score.
Following these concerns, the Conversation AI team invited developers to train their own toxicity-detection algorithms and enter them into three competitions (one per year) hosted on Kaggle, a Google subsidiary known for its community of machine learning practitioners, public data sets and challenges. To help train the AI models, Conversation AI released two public data sets containing over one million toxic and non-toxic comments from Wikipedia and a service called Civil Comments. The comments wererated on toxicity by annotators, with a “Very Toxic” label indicating “a very hateful, aggressive, or disrespectful comment that is very likely to make you leave a discussion or give up on sharing your perspective,” and a “Toxic” label meaning “a rude, disrespectful, or unreasonable comment that is somewhat likely to make you leave a discussion or give up on sharing your perspective.” Some comments were seen by many more than 10 annotators (up to thousands), due to sampling and strategies used to enforce rater accuracy.
The goal of thefirst Jigsaw challengewas to build a multilabel toxic comment classification model with labels such as “toxic”, “severe toxic”, “threat”, “insult”, “obscene”, and “identity hate”. Thesecondandthirdchallenges focused on more specific limitations of their API: minimizing unintended bias towards pre-defined identity groups and training multilingual models on English-only data.
Although the challenges led to some clever ways of improving toxic language models, our team atUnitary, a content-moderation AI company, found none of the trained models had been released publicly.
For that reason, we decided to take inspiration from the best Kaggle solutions and train our own algorithms with the specific intent of releasing them publicly. To do so, we relied on existing “transformer” models for natural language processing, such asGoogle’s BERT. Many such models are accessible in an open-sourcetransformers library.
This is how our team builtDetoxify, an open-source, user-friendly comment detection library to identify inappropriate or harmful text online. Its intended use is to help researchers and practitioners identify potential toxic comments. As part of this library, we released three different models corresponding to each of the three Jigsaw challenges. While the top Kaggle solutions for each challenge use model ensembles, which average the scores of multiple trained models, we obtained a similar performance with only one model per challenge. Each model can be easily accessed in one line of code and all models and training code are publicly available onGitHub. You can also try a demonstration inGoogle Colab.
While these models perform well in a lot of cases, it is important to also note their limitations. First, these models will work well on examples that are similar to the data they have been trained on. But they are likely to fail if faced with unfamiliar examples of toxic language. We encourage developers to fine-tune these models on data sets representative of their use case.
Furthermore, we noticed that the inclusion of insults or profanity in a text comment will almost always result in a high toxicity score, regardless of the intent or tone of the author. As an example, the sentence “I am tired of writing this stupid essay” will give a toxicity score of 99.7 percent, while removing the word ‘stupid’ will change the score to 0.05 percent.
Lastly, despite the fact that one of the released models has been specifically trained to limitunintended bias, all three models are still likely to exhibit some bias, which can pose ethical concerns when used off-the-shelf to moderate content.
Although there has been considerable progress on automatic detection of toxic speech, we still have a long way to go until models can capture the actual, nuanced, meaning behind our language—beyond the simple memorization of particular words or phrases. Of course, investing in better and more representative datasets would yield incremental improvements, but we must go a step further and begin to interpret data in context, a crucial part of understanding online behavior. A seemingly benign text post on social media accompanied by racist symbolism in an image or video would be easily missed if we only looked at the text. We know that lack of context can often be the cause of our own human misjudgments. If AI is to stand a chance of replacing manual effort on a large scale, it is imperative that we give our models the full picture.
As a Partner and Co-Founder of Predictiv and PredictivAsia, Jon specializes in management performance and organizational effectiveness for both domestic and international clients. He is an editor and author whose works include Invisible Advantage: How Intangilbles are Driving Business Performance. Learn more...
0 comments:
Post a Comment