A primer on Naive Bayes for sentiment analysis

This guide is intended to be a very unsophisticated, very broad overview of the most basic kind of sentiment analysis. You can use this to get results fast, but they’ll be dirty results. I’ll begin by throwing out the broad outline and then address several problems. We’ll begin with the basic steps: (1) Seeding, (2) Training, and (3) Evaluation.

For our purposes, we’re going to assume that all texts have a sentiment somewhere between 0 and 1 where 0 is very negative and 1 is very positive. A neutral text has a sentiment score of 0.5 under this system.

 

Seeding

The idea of seeding is that you have certain words or texts that you already know the sentiment of. For now, let’s assume you have a list of 5 words that you know the sentiment of.

  1. FANTASTIC: 0.89
  2. DISGUSTING: 0.03
  3. OKAY: 0.55
  4. DELIGHTFUL: 0.96
  5. DISAPPOINTING: 0.27

We’ll call this list the seeding corpus, which means a body of text used to jump-start our sentiment engine. Since we only know these five words, we’re not allowed to make any assessments of other words except in relation to the seeding corpus. Let’s now seed all of your texts in preparation for training. We’ll pull out all the texts that include a word in the seeding corpus.

pakistantweet

Since this text has DISGUSTING in it, we’ll score the whole text as 0.03, meaning that every word in this text gets a sentiment score of 0.03. If both “disappointing” and “disgusting” were in this text, the overall sentiment would be (0.03 + 0.27) / 2 = 0.15 and every word would get that sentiment score. Now, we get to create a sentiment dictionary. Right now, that dictionary looks like the following:

  1. FANTASTIC: 0.89
  2. DISGUSTING: 0.03
  3. OKAY: 0.55
  4. DELIGHTFUL: 0.96
  5. DISAPPOINTING: 0.27
  6. ANOTHER: 0.03
  7. DAY: 0.04
  8. (etc)

At this point, we need to ask ourselves what to do if we see a newly trained word again. If a new text had “I find the people in Pakistan delightful,” then how would we update the sentiment score in the dictionary? To understand this, we need to move onto training.

Training

The idea behind Naive Bayes is that the true sentiment of a given word is equal to the average sentiment of all sentences that it appears in. But since you’re looking at one text at a time, you’ll need to update that average on-the-fly. So instead of storing the average sentiment for each word, we’ll want to store both the average sentiment AND the number of occurrences we’ve seen so far.

equation

In plain english, the new average is the old average (weighted by number of occurrences) plus the new data. We then divide them both by the new number of occurrences.

So let’s go back to seeding for a moment. Before we’re allowed to score ALL the texts you’ve got, first we’ve got to do a round of training based on just the seeding corpus. That means you’re only looking at a fraction of your texts. Once you finish this training round, you’ve finished seeding. Keep in mind that while you’re allowed to calculate the average sentiment score for other words, you’re not allowed to change the score of the seeding corpus words. You’re also not allowed to use the new tokens during the seeding round.

Once you’ve finished seeding, you get to look at all of your text data. This time, you get to choose whether you lock in your already calculated values or not. Keep going through your documents until you’re satisfied that you’ve generated a sentiment score for everything you wanted to.

Evaluation

Once you’ve got a sentiment dictionary trained, it’s time to evaluate new sentences. But before you do that, now’s a good time to remind you that you ought to look at your sentiment dictionary to see whether it makes sense. Do you see a lot of words in there with only one occurrence? Maybe it’s best to throw them out, since they’re not going to contribute a high-confidence guess. Does a huge chunk of your corpus trend toward a neutral score (even words that you might think were super positive like “love” and “phenomenal”)? Chances are, you’ll need to do some fine-tuning before you evaluate. Be looking for a post in the near future to describe

 

Leave a Reply

Your email address will not be published. Required fields are marked *