R sentiment analysis with phrases in dictionaries

I am performing sentiment analysis on a set of Tweets that I have and I now want to know how to add phrases to the positive and negative dictionaries.

I've read in the files of the phrases I want to test but when running the sentiment analysis it doesn't give me a result.

When reading through the sentiment algorithm, I can see that it is matching the words to the dictionaries but is there a way to scan for words as well as phrases?

Here is the code:

    score.sentiment = function(sentences, pos.words, neg.words, .progress='none')
{
  require(plyr)  
  require(stringr)  
  # we got a vector of sentences. plyr will handle a list  
  # or a vector as an "l" for us  
  # we want a simple array ("a") of scores back, so we use  
  # "l" + "a" + "ply" = "laply":  
  scores = laply(sentences, function(sentence, pos.words, neg.words) {
    # clean up sentences with R's regex-driven global substitute, gsub():
    sentence = gsub('[[:punct:]]', '', sentence)
    sentence = gsub('[[:cntrl:]]', '', sentence)
    sentence = gsub('\\d+', '', sentence)    
    # and convert to lower case:    
    sentence = tolower(sentence)    
    # split into words. str_split is in the stringr package    
    word.list = str_split(sentence, '\\s+')    
    # sometimes a list() is one level of hierarchy too much    
    words = unlist(word.list)    
    # compare our words to the dictionaries of positive & negative terms
    pos.matches = match(words, pos)
    neg.matches = match(words, neg)   
    # match() returns the position of the matched term or NA    
    # we just want a TRUE/FALSE:    
    pos.matches = !is.na(pos.matches)   
    neg.matches = !is.na(neg.matches)   
    # and conveniently enough, TRUE/FALSE will be treated as 1/0 by sum():
    score = sum(pos.matches) - sum(neg.matches)    
    return(score)    
  }, pos.words, neg.words, .progress=.progress )  
  scores.df = data.frame(score=scores, text=sentences)  
  return(scores.df)  
}
analysis=score.sentiment(Tweets, pos, neg)
table(analysis$score)

This is the result I get:

0
20

whereas I am after the standard table that this function provides e.g.

-2 -1 0 1 2 
 1  2 3 4 5

for example.

Does anybody have any ideas on how to run this on phrases? Note: The TWEETS file is a file of sentences.

lrnzcig

The function score.sentiment seems to work. If I try a very simple setup,

Tweets = c("this is good", "how bad it is")
neg = c("bad")
pos = c("good")
analysis=score.sentiment(Tweets, pos, neg)
table(analysis$score)

I get the expected result,

> table(analysis$score)

-1  1 
 1  1

How are you feeding the 20 tweets to the method? From the result you're posting, that 0 20, I'd say that your problem is that your 20 tweets do not have any positive or negative word, although of course it was the case you would have noticed it. Maybe if you post more details on your list of tweets, your positive and negative words it would be easier to help you.

Anyhow, your function seems to be working just fine.

Hope it helps.

EDIT after clarifications via comments:

Actually, to solve your problem you need to tokenize your sentences into n-grams, where n would correspond to the maximum number of words you are using for your list of positive and negative n-grams. You can see how to do this e.g. in this SO question. For completeness, and since I've tested it myself, here is an example for what you could do. I simplify it to bigrams (n=2) and use the following inputs:

Tweets = c("rewarding hard work with raising taxes and VAT. #LabourManifesto", 
              "Ed Miliband is offering 'wrong choice' of 'more cuts' in #LabourManifesto")
pos = c("rewarding hard work")
neg = c("wrong choice")

You can create a bigram tokenizer like this,

library(tm)
library(RWeka)
BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min=2,max=2))

And test it,

> BigramTokenizer("rewarding hard work with raising taxes and VAT. #LabourManifesto")
[1] "rewarding hard"       "hard work"            "work with"           
[4] "with raising"         "raising taxes"        "taxes and"           
[7] "and VAT"              "VAT #LabourManifesto"

Then in your method you simply substitute this line,

word.list = str_split(sentence, '\\s+')

by this

word.list = BigramTokenizer(sentence)

Although of course it would be better if you changed word.list to ngram.list or something like that.

The result is, as expected,

> table(analysis$score)

-1  0 
 1  1

Just decide your n-gram size and add it to Weka_control and you should be fine.

Hope it helps.

来源：https://stackoverflow.com/questions/32395098/r-sentiment-analysis-with-phrases-in-dictionaries

标签

twitter

machine-learning

sentiment-analysis