Twitter Data Analysis - Error in Term Document Matrix

前端 未结 6 788
滥情空心
滥情空心 2020-12-03 18:30

Trying to do some analysis of twitter data. Downloaded the tweets and created a corpus from the text of the tweets using the below

# Creating a Corpus
wim_co         


        
6条回答
  •  予麋鹿
    予麋鹿 (楼主)
    2020-12-03 19:12

    I think the error is due to some "exotic" characters within the tweet messages, which the tm function cannot handle. I'v got the same error using tweets as a corpus source. Maybe the following workaround helps:

    # Reading some tweet messages (here from a text file) into a vector

    rawTweets <- readLines(con = "target_7_sample.txt", ok = TRUE, warn = FALSE, encoding = "utf-8") 
    

    # Convert the tweet text explicitly into utf-8

    convTweets <- iconv(rawTweets, to = "utf-8")
    

    # The above conversion leaves you with vector entries "NA", i.e. those tweets that can't be handled. Remove the "NA" entries with the following command:

    tweets <- (convTweets[!is.na(convTweets)])
    

    If the deletion of some tweets is not an issue for your solution (e.g. build a word cloud) then this approach may work, and you can proceed by calling the Corpus function of the tm package.

    Regards--Albert

提交回复
热议问题