问题
I need help in text mining using R
Title Date Content
Boy May 13 2015 "She is pretty", Tom said. Tom is handsome.
Animal June 14 2015 The penguin is cute, lion added.
Human March 09 2015 Mr Koh predicted that every human is smart...
Monster Jan 22 2015 Ms May, a student, said that John has $10.80. May loves you.
I would just want to get the opinions from what the people had said.
And also, I would like to seek help in getting the percentage (Eg. 9.8%), because when i split the sentences based on fullstop ("."), i would get "His result improved by 0." instead of "His result improved by 0.8%".
Below is the output that I would like to obtain:
Title Date Content
Boy May 13 2015 she is pretty
Animal June 14 2015 the penguin is cute
Human March 09 2015 every human is smart
Monster Jan 22 2015 john has $10.80
Below is the code that I tried, but didn't get desired output:
list <- c("said", "added", "predicted")
pattern <- paste (list, collapse = "|")
dataframe <- stack(setNames(lapply(strsplit(dataframe, '(?<=[.])', perl=TRUE), grep, pattern = pattern, value = TRUE), dataframe$Title))[2:1]
回答1:
You're close, but your regular expression for splitting is wrong. This gave the correct arrangement for the data, modulo your request to extract opinions more exactly:
txt <- '
Title Date Content
Boy May 13 2015 "She is pretty", Tom said. Tom is handsome.
Animal June 14 2015 The penguin is cute, lion added.
Human March 09 2015 Mr Koh predicted that every human is smart...
Monster Jan 22 2015 Ms May, a student, said that John has $10.80. May loves you.
'
txt <- gsub(" {2,}(?=\\S)", "|", txt, perl = TRUE)
dataframe <- read.table(sep = "|", text = txt, header = TRUE)
list <- c("said", "added", "predicted")
pattern <- paste (list, collapse = "|")
content <- strsplit(dataframe$Content, '\\.(?= )', perl=TRUE)
opinions <- lapply(content, grep, pattern = pattern, value = TRUE)
names(opinions) <- dataframe$Title
result <- stack(opinions)
In your sample data, all full stops followed by spaces are sentence-ending, so that's what the regular expression \.(?= )
matches. However that will break up sentences like "I was born in the U.S.A. but I live in Canada"
, so you might have to do additional pre-processing and checking.
Then, assuming the Title
s are unique identifiers, you can just merge
to add the dates back in:
result <- merge(dataframe[c("Title", "Date")], result, by = "Title")
As mentioned in the comments, the NLP task itself has more to do with text parsing than R programming. You can probably get some mileage out of searching for a pattern like
<optional adjectives> <noun> <verb> <optional adverbs> <adjective> <optional and/or> <optional adjective> ...
which would match your sample data, but I'm far from an expert here. You'd also need a dictionary with lexical categories. A Google search for "extract opinion text" yielded a lot of helpful results on the first page, including this site run by Bing Liu. From what I can tell, Professor Liu literally wrote the book on sentiment analysis.
来源:https://stackoverflow.com/questions/32576046/text-mining-with-r