This article is the second in a series on problems with sentiment analysis – describing common pitfalls and difficulties that need to be understood in order to correctly use these tools/models. Enjoy!
The previous article in this series offered a brief overview of sentiment analysis and the kinds of datasets we work with. This overview highlighted some of the problems that occur when your model trains on data significantly different from the data you’ll actually make predictions about – for example, using Yelp restaurant reviews to predict Amazon pest control product reviews (“Dead bugs all over!” means two very different things).
This article will deal with problems inherent to text itself. I’ll give a few quick definitions before we dive into actual examples.
- Spam: can be either (1) advertisements for unsolicited products or services or (2) more broadly, any text which should express an opinion, but does not.
- Sarcasm: text which has a literal meaning different than the intended meaning.
- Human error: a catch-all category for any mistake the writer makes such as accidentally reviewing the wrong product or writing in another language
A good, curated dataset has no spam in it. Community-moderated reviews like Yelp, Facebook, and Rotten Tomatoes do a good job at letting users flag content, meaning very little meaningless data makes it into the final dataset. Unfortunately, sentiment analysis typically requires the use of novel datasets which are not curated. Let’s examine one such example.
Imagine that you produce a television series and track non-professional reception via Twitter and Facebook. Curated review sites, although good at providing qualitative feedback, don’t provide you with the sample size you need to get an accurate measure of public opinion on an episode-by-episode basis. Instead, you use the curated reviews to train your model and get your episode feedback through raw Twitter/Facebook text data.
Herein lies a major problem: how do you tell when a text is genuinely expressing an opinion? Consider these actual texts taken from tweets and Facebook comments written about a television show
- [Spam URL] <– New Link To Episodes FREE Watch Now!!
- See Last Weeks Episodes Online At watch[show name] . [domain] . es (remove the spaces to watch! don’t report or Facebook will take us down!)
- @[show’s official handle] I am a network of finantial [sic] professionals looking to train JUNIOR PARTNER to make more clients in EASTERN USA. Guaranteed six figure income in first three months work…[more tripe]
“Sure,” you may be thinking, “but how much does a couple of spambot tweets matter in the far larger pool of the true fandom’s lively discussion?”
It may be more than half of your texts. A Chicago-based analytics firm called Networked Insights found that tweets mentioning certain global brands had a spam rate over 60%. Popular Mechanics similarly reported that around 65% of Twitter users are not real people, but bots. Facebook is a little harder to analyze, but anecdotally about half the new posts are spam on pages I manage. Spam is an issue.
- Avoid topics where spam is likely to be prevelant. If you have a sufficiently specific domain to analyze, you may be able to get away with this.
- Limit the data source. This could mean (1) only analyze tweets written by curated accounts, (2) include Facebook comments with more than 5 reactions, (3) requiring extra keywords that spambots are less likely to include or (4) selecting texts from moderated communities. Keep in mind that these limitations are going to exclude many other valid human-written texts
- Try to detect and remove spam. Easiest filter here is to exclude any text with a URL (plus sneaky spliced URLs like spamDOTdomainDOTcom). You can have a human review the texts included/excluded by this filter and get more creative from there.
- Weight sentiment by impact. Use retweets, replies, likes/favorites, etc. to determine when a text is “accepted” by human reviewers. Be careful, however, as bot nets will often work in cooperation to fool these kinds of controls.
In the last section, we assumed that you had a good sentiment model to work with. Sarcasm can corrupt training data even when we use curated sets to train our models. Let’s see how that happens.
When we find data sets to train our models on, we typically look for special sources that have both text and some numeric score associated with that text. For example, Yelp restaurant reviews have an associated star rating. You can therefore learn which words are likely to be associated with a 5-star review (“great”, “wonderful”, “delicious”) and with a 1-star review (“slow”, “soggy”, “expensive”). Sarcasm, unfortunately, tends to interfere with such associations.
"[The director's] work is a fine imitation of real film, using charming plot devices and clever dialogue to paper over the audience's dawning realization that nothing of consequence is to be had in this three-hour spectacle. Indeed, [the movie] grips all viewers with suspense as they realize its climactic twist - that none of them will be getting these hours of their life back."
Just about any sentiment model analyzing this text will be very confused to see the words “charming”, “clever”, and “suspense” associated with a one-star review. And on the Internet, everybody loves a comedian: reviews like these tend to garner plenty of attention.
There is only one solution I’ll advocate here: exclude sarcasm from training data. If you’re starting with a big dataset and can’t possibly examine the whole corpus, then debug through your training loops and see which texts are consistently guessed wrong. Check for sarcasm, and remove the text if found.
Of course, it’s always possible that you don’t need to worry about sarcasm in training data. Analyzing reviews of professional chef knives on no-nonsense forum pages? Good news, they don’t tend to tolerate sarcastic posts. Any text source where the public isn’t an audience (for example, internal customer complaints or survey responses) also tend to be free of gratuitous sarcasm. Don’t try to engineer a solution to a problem that doesn’t exist in your dataset.
There are a large category of texts which must be cleaned from datasets simply because something is fundamentally wrong with them. Consider the following examples:
- Comment on Facebook article contains a snippet of a recipe the user accidentally typed into the wrong page
- Amazon review for a music player appears to instead be providing a review of the album they listened to on that device, not the device itself.
- “Great food, best service I ever had! One star!” User does not understand the rating system
- Review provides no feedback, but instead asks follow-up question about their purchase, e.g. “where can i recharge my camera batteries?”
It’s always a good idea to actually read samples of the texts you’re analyzing.