This article is the first in a series on problems with sentiment analysis – describing common pitfalls and difficulties that need to be understood in order to correctly use these tools/models. Enjoy!
Short background to sentiment models
A classical sentiment model learns the sentiment value of given words. For example, “FANTASTIC” is generally positive (so it’d have a high sentiment score) and “WORST” is usually negative (meaning a low sentiment score). The model then combines those words to form an overall sentiment score. A document with lots of negative words should probably have a negative score and the opposite is also true.
AWFUL(-.93) + SERVICE (0.34) + VERY (0.02) + BAD (-0.89) + TODAY (-0.03) = -1.936 / 4 = -0.484 // overall negative score
Obtaining the sentiment value of these words is very costly. It tends to be the most time-consuming part of sentiment analysis. That’s because learning the sentiment values of words requires having a labelled dataset, or a series of texts labelled with sentiment. This can be done in one of two ways:
- Obtain a pre-labelled dataset. In some cases, companies charge a fee to use their dataset. But there are also free datasets available for academic use, e.g. the Yelp dataset or Kaggle contest datasets. Most large, pre-labelled datasets are created by allowing text writers to label their own sentiment score (such as writing a review and adding a star rating out of five stars).
- Add labels manually. This requires somebody to read through the texts and add scores one-by-one. Manual labeling can be menial, boring work, so it requires either sacrificing precious research time or paying somebody else to do it for you.
The domain problem
Once the work has been done to obtain the sentiment scores of individual words, calculating the sentiment score of a new text is relatively speedy. That’s a relief to the organization performing the analysis! Rather than needing to constantly read all new pieces of feedback, the sentiment model can be built once and used over and over again for the rest of its lifetime.
Except that lifetime is often shorter than expected. We like to think of sentiment models as highly generalized tools, capable of tackling any new text presented to them. In reality, sentiment models are specialized to the domain of text they were trained on. You can get some strange results by training on one dataset, then feeding in words from a new, unrelated domain. Imagine the results of using restaurant review data to evaluate product feedback of pest control: “Nearly odorless, smells like a farmyard. Dead bugs everywhere!”
Sentiment analysis can run into the domain problem in one of several ways. We’ll explore three of them below.
1. Unrelated subject matter
This is the most common type of domain mismatching. It involves training a sentiment model on texts involving one subject and then using that model to predict a text from an unrelated subject. Our pest control text above is a good example of this.
As unbelievable as that example may be, domain mismatch on subject is more prevalent than you may expect. Many of the largest, most academically trusted sentiment dataset and models rely on curated reviews or social media posts. Reviews tend to use very specific vocabulary (think of words you’d use to praise or criticize a film or plumbing services). Social media posts run into the opposite problem, running so general that individual subject domain is lost.
Consider this canonical example from the CalliopeSPS papers:
In this tweet, notice that “hysterical” is marked as very negative (-0.933). We’re quite pleased with that, since the author meant to criticize Yahoo Finance for inciting hysterics with their doomsday clickbait coverage. In fact, this sentiment score works great for the domain of economics and financial news. “HYSTERICAL” is overall a very negative word to use regarding finance.
But the same parser above did terribly when it tried to predict the sentiment of tweets mentioning the comedian Gabriel Iglesias. His latest show being “hysterical” was a very positive thing. The sentiment model failed when it tried to switch domains.
2. Register switching
Language registers (like academic texts, religious speeches, or blog posts) vary greatly in their use of words to convey sentiment. Switching registers can confuse both humans and sentiment parsers.
Consider the registers of Facebook comments vs. newspaper critiques.
Example 1: "By the time the president finally kicked reporters out of the meeting, he had said yes to everyone while clarifying virtually nothing. And what was undeniably a victory for government transparency had turned into another frustrating experience for..." - The Atlantic, 9 Jan. 2018
Example 2: "SOOOO how this bozo still in office? Does he actually have a real opinion on anything or is he just an angry orange parrot?" - Facebook user response to The Atlantic's official account who posted the article from Example 1.
The amazing thing here is that both of the above texts basically convey an identical message in completely different registers. Yet the lexicon used to express sentiment is only partially shared.
3. Diachronic change
Language changes over time. The sentiment of words can change as well, and it can be very difficult to accurately predict these shifts. Let’s look at a few examples:
See how the sentiment of “awesome” changed in the past 150 years
Example 1. "...Living all unconscious of the awesome contrast between the pale expectancy of their panic-stricken faces and the repose of that one untroubled countenance. " - Louisa May Alcott, 1869
Example 2. "Anteaters are much bigger, and they have long hollow noses, really long. And claws that are awesome." - Tina McElroy Ansa, 1993
The meaning of “awesome” shifted from a neutral multiplier (like “very” or “such”) to a clearly positive marker.
In classical Greek, a demagogue was previously a popular leader, but it is now a pandering politician. Crafty (from craeftig) used to be a compliment, but now carries a connotation of dishonest or manipulative.
Those examples demonstrate historical examples, but the same changes occur nowadays as well. Consider also that there are vast categories of words which pejorate so rapidly that we often have three or more variants, each with a varying degree of political correctness:
- Words describing humans with dwarfism, such as dwarf, midget, or little person
- Vocabulary of sex acts: breed, copulate, make love
- Terminology for mental or physical disabilities: the words handicap and retardation were adopted into language as perfectly polite terms, but are now considered highly offensive.
- References to a dead body: carcass, corpse, remains
- Racial terms. Not giving examples here, but you get the point.
Here are a few fun examples that have burned me in the past, but don’t fall neatly into any academic-sounding category:
- Around the turn of the 20th century, “cad” meant a dishonorable man, now refers almost exclusively to Computer Aided Design software. Except for a very strange corner of the Internet, which appears to be mostly women aged 20-35 who quote a lot of Jane Austen.
- I was following a particular brand on Twitter in 2014 who made used the word “jack” in a marketing campaign. Twitter users didn’t like this and “jacked” suddenly shot up to be the most popular refutations against their campaign. It became one of the best predictors that the text would be net negative in sentiment score.
- That same Twitter project was plagued by a popular phenomenon known as the Doge meme. I had a shortlist of words that could be used to identify texts that were likely non-neutral. Such, very, and wow were three examples…but in the wake of the canonical doge “such <adjective>, very <noun> wow” that list had to be quickly modified.
If there’s a lesson in this, perhaps the knee-jerk reaction is sentiment analysis is really hard. But the pragmatic takeaway could be better stated as sentiment scoring becomes more accurate as the training data approaches the prediction data. That is, if your training data comes from a different register, subject, or time period, you’re likely to mislabel words.
One of the worst mistakes one can make in sentiment analysis is to trust the output of a model beyond its capabilities. Think of computed sentiment as a heuristic, a very rough guess that only tends to become accurate when viewed in broad strokes.