问题
I have a of vector consisting of Tweets (just the message text) that I am cleaning for text mining purposes. I have used removePunctuation
from the tm
package like so:
clean_tweet_text = removePunctuation(tweet_text)
This have resulted in a vector with all punctuation removed from the text except apostrophes, which ruins my keyword searches because words touching apostrophes are not registered. For example, one of my keywords is climate
but if a tweet has 'climate
it won't be counted.
How can I removes all the apostrophes/single quotes from my vector?
Here is the header from dput
for a reproducible example:
c("expert briefing on climatechange disarmament sdgs nmun httpstco5gqkngpkap",
"who uses nasa earth science data he looks at impact of aerosols on climateamp weather httpstcof4azsiqkw1 https…",
"rt oddly enough some republicans think climate change is real oddly enough… httpstcomtlfx1mnuf uniteblue https…",
"better dead than red bill gates says that only socialism can save us from climate change httpstcopypqmd1fok",
"i see red people bill gates says that only socialism can save us from climate change httpstcopypqmd1fok",
"why go for ecosystem basses conservation climatechange raajje maldives ecocaremv httpstcorauhjbasyl",
"ted cruz ‘climate change is not science it’s religion’ httpstco0qqtbofe0h via glennbeck",
"unusual warming kills gulf of maine cod discovery news globalwarming httpstco39uvock3xe",
"this is an amusing headline bill gates says that only socialism can save us from climate change httpstcobfs5zbcijc",
"what do the remaining republican candidates have to say about climate change fixgov httpstcoxpszwbrcnh httpstcodgqyidkw6o"
)
回答1:
To remove all punctuation (including apostrophes and single quotes), you can just use gsub()
:
x <- c("expert briefing on climatechange disarmament sdgs nmun httpstco5gqkngpkap",
"who uses nasa earth science data he looks at impact of aerosols on climateamp weather httpstcof4azsiqkw1 https…",
"rt oddly enough some republicans think climate change is real oddly enough… httpstcomtlfx1mnuf uniteblue https…",
"better dead than red bill gates says that only socialism can save us from climate change httpstcopypqmd1fok",
"i see red people bill gates says that only socialism can save us from climate change httpstcopypqmd1fok",
"why go for ecosystem basses conservation climatechange raajje maldives ecocaremv httpstcorauhjbasyl",
"ted cruz ‘climate change is not science it’s religion’ httpstco0qqtbofe0h via glennbeck",
"unusual warming kills gulf of maine cod discovery news globalwarming httpstco39uvock3xe",
"this is an amusing headline bill gates says that only socialism can save us from climate change httpstcobfs5zbcijc",
"what do the remaining republican candidates have to say about climate change fixgov httpstcoxpszwbrcnh httpstcodgqyidkw6o")
gsub("[[:punct:]]", "", x)
#> [1] "expert briefing on climatechange disarmament sdgs nmun httpstco5gqkngpkap"
#> [2] "who uses nasa earth science data he looks at impact of aerosols on climateamp weather httpstcof4azsiqkw1 https"
#> [3] "rt oddly enough some republicans think climate change is real oddly enough httpstcomtlfx1mnuf uniteblue https"
#> [4] "better dead than red bill gates says that only socialism can save us from climate change httpstcopypqmd1fok"
#> [5] "i see red people bill gates says that only socialism can save us from climate change httpstcopypqmd1fok"
#> [6] "why go for ecosystem basses conservation climatechange raajje maldives ecocaremv httpstcorauhjbasyl"
#> [7] "ted cruz climate change is not science its religion httpstco0qqtbofe0h via glennbeck"
#> [8] "unusual warming kills gulf of maine cod discovery news globalwarming httpstco39uvock3xe"
#> [9] "this is an amusing headline bill gates says that only socialism can save us from climate change httpstcobfs5zbcijc"
#> [10] "what do the remaining republican candidates have to say about climate change fixgov httpstcoxpszwbrcnh httpstcodgqyidkw6o"
gsub()
replaces all occurrences of its first argument in its third argument with its second argument (see help("gsub")
). Here, that means it replaces all occurrences in our vector x
of any of the characters in the set [[:punct:]]
with ""
(remove them).
What characters does that remove? From help("regex")
:
[:punct:]
Punctuation characters:
! " # $ % & ' ( ) * + , - . / : ; < = > ? @ [ \ ] ^ _ ` { | } ~.
Update
It appears this occurs because your apostrophes are like ‘
instead of like '
. So, if you want to stick with tm::removePunctuation()
, you can also use
tm::removePunctuation(x, ucp = TRUE)
#> [1] "expert briefing on climatechange disarmament sdgs nmun httpstco5gqkngpkap"
#> [2] "who uses nasa earth science data he looks at impact of aerosols on climateamp weather httpstcof4azsiqkw1 https"
#> [3] "rt oddly enough some republicans think climate change is real oddly enough httpstcomtlfx1mnuf uniteblue https"
#> [4] "better dead than red bill gates says that only socialism can save us from climate change httpstcopypqmd1fok"
#> [5] "i see red people bill gates says that only socialism can save us from climate change httpstcopypqmd1fok"
#> [6] "why go for ecosystem basses conservation climatechange raajje maldives ecocaremv httpstcorauhjbasyl"
#> [7] "ted cruz climate change is not science its religion httpstco0qqtbofe0h via glennbeck"
#> [8] "unusual warming kills gulf of maine cod discovery news globalwarming httpstco39uvock3xe"
#> [9] "this is an amusing headline bill gates says that only socialism can save us from climate change httpstcobfs5zbcijc"
#> [10] "what do the remaining republican candidates have to say about climate change fixgov httpstcoxpszwbrcnh httpstcodgqyidkw6o"
来源:https://stackoverflow.com/questions/53392785/remove-all-punctuation-from-text-including-apostrophes-for-tm-package