Hatebase catalogues the world’s hate speech in real time so you don’t have to
Policing hate speech is something nearly every online communication platform struggles with. Because to police it, you must detect it; and to detect it, you must understand it. Hatebase is a company that has made understanding hate speech its primary mission, and it provides that understanding as a service — an increasingly valuable one.
Essentially Hatebase analyzes language use on the web, structures and contextualizes the resulting data, and sells (or provides) the resulting database to companies and researchers that don’t have the expertise to do this themselves.
The Canadian company, a small but growing operation, emerged out of research at the Sentinel Project into predicting and preventing atrocities based on analyzing the language used in a conflict-ridden region.
“What Sentinel discovered was that hate speech tends to precede escalation of these conflicts,” explained Timothy Quinn, founder and CEO of Hatebase. “I partnered with them to build Hatebase as a pilot project — basically a lexicon of multilingual hate speech. What surprised us was that a lot of other NGOs [non-governmental organizations] started using our data for the same purpose. Then we started getting a lot of commercial entities using our data. So last year we decided to spin it out as a startup.”
You might be thinking, “what’s so hard about detecting a handful ethnic slurs and hateful phrases?” And sure, anyone can tell you (perhaps reluctantly) the most common slurs and offensive things to say — in their language… that they know of. There’s much more to hate speech than just a couple ugly words. It’s an entire genre of slang, and the slang of a single language would fill a dictionary. What about the slang of all languages?
A shifting lexicon
As Victor Hugo pointed out in Les Miserables, slang (or “argot” in French) is the most mutable part of any language. These words can be “solitary, barbarous, sometimes hideous words… Argot, being the idiom of corruption, is easily corrupted. Moreover, as it always seeks disguise so soon as it perceives it is understood, it transforms itself.”
Not only is slang and hate speech voluminous, but it is ever-shifting. So the task of cataloguing it is a continuous one.
Hatebase uses a combination of human and automated processes to scrape the public web for uses of hate-related terms. “We go out to a bunch of sources — the biggest, as you might imagine, is Twitter — and we pull it all in and turn it over to Hatebrain. It’s a natural language program that goes through the post and returns true, false, or unknown.”
True means it’s pretty sure it’s hate speech — as you can imagine, there are plenty of examples of this. False means no, of course. And unknown means it can’t be sure; perhaps it’s sarcasm, or academic chatter about a phrase, or someone using a word who belongs to the group and is attempting to reclaim it or rebuke others who use it. Those are the values that go out via the API, and users can choose to look up more information or context in the larger database, including location, frequency, level of offensiveness, and so on. With that kind of data you can understand global trends, correlate activity with other events, or simply keep abreast of the fast-moving world of ethnic slurs.
Quinn doesn’t pretend the process is magical or perfect, though. “There are very few 100 percents coming out of Hatebrain,” he explained. “It varies a little from the machine learning approach others use. ML is great when you have an unambiguous training set, but with human speech, and hate speech, which can be so nuanced, that’s when you get bias floating in. We just don’t have a massive corpus of hate speech, because no one can agree on what hate speech is.”
That’s part of the problem faced by companies like Google, Twitter, and Facebook — you can’t automate what can’t be automatically understood.
Fortunately Hatebrain also employs human intelligence, in the form of a corps of volunteers and partners who authenticate, adjudicate, and aggregate the more ambiguous data points.
“We have a bunch of NGOs that partner with us in linguistically diverse regions around the world, and we just launched our ‘citizen li