FUN-error after running 'tolower' while making Twitter wordcloud

问题

Trying to create wordcloud from twitter data, but get the following error:

Error in FUN(X[[72L]], ...) : 
  invalid input '������������❤������������ "@xxx:bla, bla, bla... http://t.co/56Fb78aTSC"' in 'utf8towcs'

This error appears after running the "mytwittersearch_corpus<- tm_map(mytwittersearch_corpus, tolower)" code

mytwittersearch_list <-sapply(mytwittersearch, function(x) x$getText())

mytwittersearch_corpus <-Corpus(VectorSource(mytwittersearch_corpus_list))
mytwittersearch_corpus<-tm_map(mytwittersearch_corpus, tolower)
mytwittersearch_corpus<-tm_map( mytwittersearch_corpus, removePunctuation)
mytwittersearch_corpus <-tm_map(mytwittersearch_corpus, function(x) removeWords(x, stopwords()))

I read on other pages this may be due to R having difficulty processing symbols, emoticons and letters in non-English languages, but this appears not to be the problem with the "error tweets" that R has issues with. I did run the codes:

mytwittersearch_corpus <- tm_map(mytwittersearch_corpus, function(x) iconv(enc2utf8(x), sub = "byte"))
mytwittersearch_corpus<- tm_map(mytwittersearch_corpus, content_transformer(function(x)    iconv(enc2utf8(x), sub = "bytes")))

These do not help. I also get that it can't find function content_transformer even though the tm-package is checked off and running.

I'm running this on OS X 10.6.8 and using the latest RStudio.

回答1:

I use this code to get rid of the problem characters:

tweets$text <- sapply(tweets$text,function(row) iconv(row, "latin1", "ASCII", sub=""))

回答2:

A nice example on creating wordcloud from Twitter data is here. Using the example, and the code below, and passing the tolower parameter while creating the TermDocumentMatrix, I could create a Twitter wordcloud.

library(twitteR)
library(tm)
library(wordcloud)
library(RColorBrewer)
library(ggplot2)


#Collect tweets containing 'new year'
tweets = searchTwitter("new year", n=50, lang="en")

#Extract text content of all the tweets
tweetTxt = sapply(tweets, function(x) x$getText())

#In tm package, the documents are managed by a structure called Corpus
myCorpus = Corpus(VectorSource(tweetTxt))

#Create a term-document matrix from a corpus
tdm = TermDocumentMatrix(myCorpus,control = list(removePunctuation = TRUE,stopwords = c("new", "year", stopwords("english")), removeNumbers = TRUE, tolower = TRUE))

#Convert as matrix
m = as.matrix(tdm)

#Get word counts in decreasing order
word_freqs = sort(rowSums(m), decreasing=TRUE) 

#Create data frame with words and their frequencies
dm = data.frame(word=names(word_freqs), freq=word_freqs)

#Plot wordcloud
wordcloud(dm$word, dm$freq, random.order=FALSE, colors=brewer.pal(8, "Dark2"))

回答3:

Have you tried updating tm and using stri_trans_tolower from stringi?

library(twitteR)
library(tm)
library(stringi)
setup_twitter_oauth("CONSUMER_KEY", "CONSUMER_SECRET")
mytwittersearch <- showStatus(551365749550227456) 
mytwittersearch_list <- mytwittersearch$getText()
mytwittersearch_corpus <- Corpus(VectorSource(mytwittersearch_list))

mytwittersearch_corpus <- tm_map(mytwittersearch_corpus, content_transformer(tolower))
# Error in FUN(content(x), ...) : 
#   invalid input 'í ½í±…í ¼í¾¯â¤í ¼í¾§í ¼í½œ "@comScore: Nearly half of #Millennials do at least some of their video viewing from a smartphone or tablet: http://t.co/56Fb78aTSC"' in 'utf8towcs'

mytwittersearch_corpus <- tm_map(mytwittersearch_corpus, content_transformer(stri_trans_tolower))
inspect(mytwittersearch_corpus)
# <<VCorpus (documents: 1, metadata (corpus/indexed): 0/0)>>
#   
# [[1]]
# <<PlainTextDocument (metadata: 7)>>
# <ed><U+00A0><U+00BD><ed><U+00B1><U+0085><ed><U+00A0><U+00BC><ed><U+00BE><U+00AF><U+2764><ed><U+00A0><U+00BC><ed><U+00BE><U+00A7><ed><U+00A0><U+00BC><ed><U+00BD><U+009C> "@comscore: nearly half of #millennials do at least some of their video viewing from a smartphone or tablet: http://t.co/56fb78atsc"

回答4:

The above solutions may have worked but not anymore in the newest versions of wordcloud and tm.

This problem almost made me crazy, but I found a solution and want to explain it the best I can to save anyone becoming desperate.

The function which is implicitly called by wordcloud and responsible for throwing the error

 Error in FUN(content(x), ...) : in 'utf8towcs'

is this one:

words.corpus <- tm_map(words.corpus, tolower)

which is a shortcut for

words.corpus <- tm_map(words.corpus, content_transformer(tolower))

To provide a reproducible example, here's a function that embeds the solution:

plot_wordcloud <- function(words, max_words = 70, remove_words ="",
                           n_colors = 5, palette = "Set1")
{
    require(dplyr)
    require(wordcloud)
    require(RColorBrewer) # for brewer.pal()
    require(tm) # for tm_map()

    # Solution: remove all non-printable characters in UTF-8 with this line
    words <- iconv(words, "ASCII", "UTF-8", sub="byte")

    wc <- wordcloud(words=words.corpus, max.words=max_words,
                    random.order=FALSE,
                    colors = brewer.pal(n_colors, palette),
                    random.color = FALSE,
                    scale=c(5.5,.5), rot.per=0.35) %>% recordPlot
    return(wc)
}

Here's what failed:

I tried to convert the text BEFORE and AFTER creating the corpus with

words.corpus <- Corpus(VectorSource(words))

BEFORE:

Converting to UTF-8 on the text didn't work with:

words <- sapply(words, function(x) iconv(enc2utf8(x), sub = "byte"))

nor

for (i in 1:length(words))
{
    Encoding(words[[i]])="UTF-8"
}

AFTER:

Converting to UTF-8 on the corpus didn't work with:

    words.corpus <- tm_map(words.corpus, removeWords, remove_words)

nor

    words.corpus <- tm_map(words.corpus, content_transformer(stringi::stri_trans_tolower))

nor

    words.corpus <- tm_map(words.corpus, function(x) iconv(x, to='UTF-8'))

nor

    words.corpus <- tm_map(words.corpus, enc2utf8)

nor

    words.corpus <- tm_map(words.corpus, tolower)

All these solutions may have worked at a certain point in time, so I don't want to discredit the authors. They may work some time in the future. But why they didn't work is almost impossible to say because there were good reasons why they were supposed to work. Anyway, just remember to convert the text before creating the corpus with:

    words <- iconv(words, "ASCII", "UTF-8", sub="byte")

Disclaimer: I got the solution with more detailed explanation here: http://www.textasdata.com/2015/02/encoding-headaches-emoticons-and-rs-handling-of-utf-816/

回答5:

I ended up with updating my RStudio and packages. This seemed to solve the tolower/ content_transformer issues. I read somewhere that the last tm-package had some issues with tm_map, so maybe that was the problem. In any case, this worked!

回答6:

Instead of

corp <- tm_map(corp, content_transformer(tolower), mc.cores=1)

use

corp <- tm_map(corp, tolower, mc.cores=1)

回答7:

While using code similar to that above and working on a word cloud shiny app which ran fine on my own pc, but didn't work either on amazon aws or shiny apps.io, I discovered that text with 'accents',e.g. santé in it didn't upload well as csv files to the cloud. I found a solution by saving the files as .txt files and in utf-8 using notepad and re-writing my code to allow for the fact that the files were no longer csv but txt. My versions of R was 3.2.1 and Rstudio was Version 0.99.465

回答8:

Just to mention, I had the same problem in a different context (nothing to do with tm or Twitter). For me, the solution was iconv(x, "latin1", "UTF-8"), even though Encoding() told me it was already UTF-8.

来源：https://stackoverflow.com/questions/27756693/fun-error-after-running-tolower-while-making-twitter-wordcloud

标签

twitter