Text Mining with R | 易学教程

问题

I need help in text mining using R

Title      Date            Content    
Boy        May 13 2015     "She is pretty", Tom said. Tom is handsome.
Animal     June 14 2015    The penguin is cute, lion added.
Human      March 09 2015   Mr Koh predicted that every human is smart...
Monster    Jan 22 2015     Ms May, a student, said that John has $10.80. May loves you.

I would just want to get the opinions from what the people had said.

And also, I would like to seek help in getting the percentage (Eg. 9.8%), because when i split the sentences based on fullstop ("."), i would get "His result improved by 0." instead of "His result improved by 0.8%".

Below is the output that I would like to obtain:

Title      Date            Content    
Boy        May 13 2015     she is pretty
Animal     June 14 2015    the penguin is cute
Human      March 09 2015   every human is smart
Monster    Jan 22 2015     john has $10.80

Below is the code that I tried, but didn't get desired output:

list <- c("said", "added", "predicted")
pattern <- paste (list, collapse = "|")
dataframe <- stack(setNames(lapply(strsplit(dataframe, '(?<=[.])', perl=TRUE), grep, pattern = pattern, value = TRUE), dataframe$Title))[2:1]

回答1:

You're close, but your regular expression for splitting is wrong. This gave the correct arrangement for the data, modulo your request to extract opinions more exactly:

txt <- '
Title      Date            Content    
Boy        May 13 2015     "She is pretty", Tom said. Tom is handsome.
Animal     June 14 2015    The penguin is cute, lion added.
Human      March 09 2015   Mr Koh predicted that every human is smart...
Monster    Jan 22 2015     Ms May, a student, said that John has $10.80. May loves you.
'

txt <- gsub(" {2,}(?=\\S)", "|", txt, perl = TRUE)
dataframe <- read.table(sep = "|", text = txt, header = TRUE)

list <- c("said", "added", "predicted")
pattern <- paste (list, collapse = "|")

content <- strsplit(dataframe$Content, '\\.(?= )', perl=TRUE)
opinions <- lapply(content, grep, pattern = pattern, value = TRUE)
names(opinions) <- dataframe$Title
result <- stack(opinions)

In your sample data, all full stops followed by spaces are sentence-ending, so that's what the regular expression \.(?= ) matches. However that will break up sentences like "I was born in the U.S.A. but I live in Canada", so you might have to do additional pre-processing and checking.

Then, assuming the Titles are unique identifiers, you can just merge to add the dates back in:

result <- merge(dataframe[c("Title", "Date")], result, by = "Title")

As mentioned in the comments, the NLP task itself has more to do with text parsing than R programming. You can probably get some mileage out of searching for a pattern like

<optional adjectives> <noun> <verb> <optional adverbs> <adjective> <optional and/or> <optional adjective> ...

which would match your sample data, but I'm far from an expert here. You'd also need a dictionary with lexical categories. A Google search for "extract opinion text" yielded a lot of helpful results on the first page, including this site run by Bing Liu. From what I can tell, Professor Liu literally wrote the book on sentiment analysis.

来源：https://stackoverflow.com/questions/32576046/text-mining-with-r

标签

text-mining