Sentiment mining when you’ve got no labels

What’s sentiment mining?

Sentiment mining is a way of computing how positive/negative a text is. It’s useful when you’ve got too much text to read by yourself, but want to know the overall feeling of certain texts. For example, a company who is curious about public reception of their newest product can read in all tweets mentioning the product’s name and get a sentiment score back.

How does it work?

Usually, when you’re trying to score a text based on sentiment, you first get some labels. You might have someone manually go through documents and mark each one as either “positive” or “negative.” If the users have given a rating 1-10 and accompanying text, you could use their rating as the text tag. Once you’ve got labels, it’s just a matter of running an algorithm that predicts the labels based on the text (Naive Bayes, SVM, HMM, etc.). For our purposes, let’s pretend you already know how to do that.

But what do you do when there’s no labels available? Twitter, for example, has no tag that says, “I’m super angry right now.” As mentioned previously, you could get someone to manually tag the documents, but that’s hard for several reasons:

  • People tend to disagree on sentiment classification. Consider the tweet “gonna break som [sic] faces 2nite on cap.” Is the writer feeling violently angry? Pumped up for some athletic event? It’s very difficult to tell, and manual labels will likely disagree.
  • Manual labeling is boring as hell, and nobody wants to do it.
  • It takes time to do it by hand, and that usually means either the researcher loses valuable time or valuable money by hiring somebody else to do it.
  • Did I mention it’s just really really boring?

No labels needed

Luckily, there may be a better way! Although most people don’t explicitly write “I’m happy” or “I’m angry” into their published documents, they do tend to use emoticons. Emoticons are a great way to approximate the sentiment of a given text!

Jonathan Read of the University of Sussex (U.K.) published his paper “Using emoticons to reduce dependency” in 2005, which details the method for using this technique in practice. The basic idea is that we use positive emoticons such as “:)” and “:D” as labels for positive sentiment and “:(” or “>:(” as a negative label.

So how well does this all work? Pretty well, it turns out. The report shows that the emoticons don’t do as well as user-defined labels (ratings) or manually defined tags, but tend to perform almost as well.


The same pitfalls of regular sentiment analysis apply when we’re using emoticons as labels.

  1. You can’t train a learning model on emoticons from one data set and then apply them to another. The sentiment scores per word are domain-specific – for example, the word “hysterical” may be very positive when speaking about a comedian, but is probably negative when speaking about financial institutions.
  2. Sometimes users are sarcastic. Read’s example in his paper is, “Thank you so much, that’s really encouraging :-(“
  3. Spelling errors can confound the classifier.
  4. Accuracy erodes over time. The word “Obama” has changed its sentiment a lot from 2007-2016.

Leave a Reply

Your email address will not be published. Required fields are marked *