问题
Trying to create wordcloud from twitter data, but get the following error:
Error in FUN(X[[72L]], ...) :
invalid input '������������❤������������ "@xxx:bla, bla, bla... http://t.co/56Fb78aTSC"' in 'utf8towcs'
This error appears after running the "mytwittersearch_corpus<- tm_map(mytwittersearch_corpus, tolower)" code
mytwittersearch_list <-sapply(mytwittersearch, function(x) x$getText())
mytwittersearch_corpus <-Corpus(VectorSource(mytwittersearch_corpus_list))
mytwittersearch_corpus<-tm_map(mytwittersearch_corpus, tolower)
mytwittersearch_corpus<-tm_map( mytwittersearch_corpus, removePunctuation)
mytwittersearch_corpus <-tm_map(mytwittersearch_corpus, function(x) removeWords(x, stopwords()))
I read on other pages this may be due to R having difficulty processing symbols, emoticons and letters in non-English languages, but this appears not to be the problem with the "error tweets" that R has issues with. I did run the codes:
mytwittersearch_corpus <- tm_map(mytwittersearch_corpus, function(x) iconv(enc2utf8(x), sub = "byte"))
mytwittersearch_corpus<- tm_map(mytwittersearch_corpus, content_transformer(function(x) iconv(enc2utf8(x), sub = "bytes")))
These do not help. I also get that it can't find function content_transformer
even though the tm-package
is checked off and running.
I'm running this on OS X 10.6.8 and using the latest RStudio.
回答1:
I use this code to get rid of the problem characters:
tweets$text <- sapply(tweets$text,function(row) iconv(row, "latin1", "ASCII", sub=""))
回答2:
A nice example on creating wordcloud from Twitter data is here. Using the example, and the code below, and passing the tolower parameter while creating the TermDocumentMatrix, I could create a Twitter wordcloud.
library(twitteR)
library(tm)
library(wordcloud)
library(RColorBrewer)
library(ggplot2)
#Collect tweets containing 'new year'
tweets = searchTwitter("new year", n=50, lang="en")
#Extract text content of all the tweets
tweetTxt = sapply(tweets, function(x) x$getText())
#In tm package, the documents are managed by a structure called Corpus
myCorpus = Corpus(VectorSource(tweetTxt))
#Create a term-document matrix from a corpus
tdm = TermDocumentMatrix(myCorpus,control = list(removePunctuation = TRUE,stopwords = c("new", "year", stopwords("english")), removeNumbers = TRUE, tolower = TRUE))
#Convert as matrix
m = as.matrix(tdm)
#Get word counts in decreasing order
word_freqs = sort(rowSums(m), decreasing=TRUE)
#Create data frame with words and their frequencies
dm = data.frame(word=names(word_freqs), freq=word_freqs)
#Plot wordcloud
wordcloud(dm$word, dm$freq, random.order=FALSE, colors=brewer.pal(8, "Dark2"))

回答3:
Have you tried updating tm
and using stri_trans_tolower
from stringi
?
library(twitteR)
library(tm)
library(stringi)
setup_twitter_oauth("CONSUMER_KEY", "CONSUMER_SECRET")
mytwittersearch <- showStatus(551365749550227456)
mytwittersearch_list <- mytwittersearch$getText()
mytwittersearch_corpus <- Corpus(VectorSource(mytwittersearch_list))
mytwittersearch_corpus <- tm_map(mytwittersearch_corpus, content_transformer(tolower))
# Error in FUN(content(x), ...) :
# invalid input 'í ½í±…í ¼í¾¯â¤í ¼í¾§í ¼í½œ "@comScore: Nearly half of #Millennials do at least some of their video viewing from a smartphone or tablet: http://t.co/56Fb78aTSC"' in 'utf8towcs'
mytwittersearch_corpus <- tm_map(mytwittersearch_corpus, content_transformer(stri_trans_tolower))
inspect(mytwittersearch_corpus)
# <<VCorpus (documents: 1, metadata (corpus/indexed): 0/0)>>
#
# [[1]]
# <<PlainTextDocument (metadata: 7)>>
# <ed><U+00A0><U+00BD><ed><U+00B1><U+0085><ed><U+00A0><U+00BC><ed><U+00BE><U+00AF><U+2764><ed><U+00A0><U+00BC><ed><U+00BE><U+00A7><ed><U+00A0><U+00BC><ed><U+00BD><U+009C> "@comscore: nearly half of #millennials do at least some of their video viewing from a smartphone or tablet: http://t.co/56fb78atsc"
回答4:
The above solutions may have worked but not anymore in the newest versions of wordcloud and tm.
This problem almost made me crazy, but I found a solution and want to explain it the best I can to save anyone becoming desperate.
The function which is implicitly called by wordcloud and responsible for throwing the error
Error in FUN(content(x), ...) : in 'utf8towcs'
is this one:
words.corpus <- tm_map(words.corpus, tolower)
which is a shortcut for
words.corpus <- tm_map(words.corpus, content_transformer(tolower))
To provide a reproducible example, here's a function that embeds the solution:
plot_wordcloud <- function(words, max_words = 70, remove_words ="",
n_colors = 5, palette = "Set1")
{
require(dplyr)
require(wordcloud)
require(RColorBrewer) # for brewer.pal()
require(tm) # for tm_map()
# Solution: remove all non-printable characters in UTF-8 with this line
words <- iconv(words, "ASCII", "UTF-8", sub="byte")
wc <- wordcloud(words=words.corpus, max.words=max_words,
random.order=FALSE,
colors = brewer.pal(n_colors, palette),
random.color = FALSE,
scale=c(5.5,.5), rot.per=0.35) %>% recordPlot
return(wc)
}
Here's what failed:
I tried to convert the text BEFORE and AFTER creating the corpus with
words.corpus <- Corpus(VectorSource(words))
BEFORE:
Converting to UTF-8 on the text didn't work with:
words <- sapply(words, function(x) iconv(enc2utf8(x), sub = "byte"))
nor
for (i in 1:length(words))
{
Encoding(words[[i]])="UTF-8"
}
AFTER:
Converting to UTF-8 on the corpus didn't work with:
words.corpus <- tm_map(words.corpus, removeWords, remove_words)
nor
words.corpus <- tm_map(words.corpus, content_transformer(stringi::stri_trans_tolower))
nor
words.corpus <- tm_map(words.corpus, function(x) iconv(x, to='UTF-8'))
nor
words.corpus <- tm_map(words.corpus, enc2utf8)
nor
words.corpus <- tm_map(words.corpus, tolower)
All these solutions may have worked at a certain point in time, so I don't want to discredit the authors. They may work some time in the future. But why they didn't work is almost impossible to say because there were good reasons why they were supposed to work. Anyway, just remember to convert the text before creating the corpus with:
words <- iconv(words, "ASCII", "UTF-8", sub="byte")
Disclaimer: I got the solution with more detailed explanation here: http://www.textasdata.com/2015/02/encoding-headaches-emoticons-and-rs-handling-of-utf-816/
回答5:
I ended up with updating my RStudio and packages. This seemed to solve the tolower/ content_transformer issues. I read somewhere that the last tm-package had some issues with tm_map, so maybe that was the problem. In any case, this worked!
回答6:
Instead of
corp <- tm_map(corp, content_transformer(tolower), mc.cores=1)
use
corp <- tm_map(corp, tolower, mc.cores=1)
回答7:
While using code similar to that above and working on a word cloud shiny app which ran fine on my own pc, but didn't work either on amazon aws or shiny apps.io, I discovered that text with 'accents',e.g. santé in it didn't upload well as csv files to the cloud. I found a solution by saving the files as .txt files and in utf-8 using notepad and re-writing my code to allow for the fact that the files were no longer csv but txt. My versions of R was 3.2.1 and Rstudio was Version 0.99.465
回答8:
Just to mention, I had the same problem in a different context (nothing to do with tm or Twitter). For me, the solution was iconv(x, "latin1", "UTF-8")
, even though Encoding()
told me it was already UTF-8.
来源:https://stackoverflow.com/questions/27756693/fun-error-after-running-tolower-while-making-twitter-wordcloud