Twitter Data Analysis - Error in Term Document Matrix

前端 未结 6 784
滥情空心
滥情空心 2020-12-03 18:30

Trying to do some analysis of twitter data. Downloaded the tweets and created a corpus from the text of the tweets using the below

# Creating a Corpus
wim_co         


        
6条回答
  •  感情败类
    2020-12-03 19:23

    I have found a way to solve this problem in an article about TM.

    An example in which the error follows below:

    getwd()
    require(tm)
    
    # Importing files
    files <- DirSource(directory = "texts/",encoding ="latin1" )
    
    # loading files and creating a Corpus
    corpus <- VCorpus(x=files)
    
    # Summary
    
    summary(corpus)
    corpus <- tm_map(corpus,removePunctuation)
    corpus <- tm_map(corpus,stripWhitespace)
    corpus <- tm_map(corpus,removePunctuation)
    matrix_terms <- DocumentTermMatrix(corpus)
    
    Warning messages:
    In TermDocumentMatrix.VCorpus(x, control) : invalid document identifiers
    

    This error occurs because you need an object of the class Vector Source to do your Term Document Matrix, but the previous transformations transform your corpus of texts in character, therefore, changing a class which is not accepted by the function.

    However, if you add one more command before using the function TermDocumentMatrix you can keep going.

    Below follows the code with the new command:

    getwd()
    require(tm)  
    
    files <- DirSource(directory = "texts/",encoding ="latin1" )
    
    # loading files and creating a Corpus
    corpus <- VCorpus(x=files)
    
    # Summary 
    summary(corpus)
    corpus <- tm_map(corpus,removePunctuation)
    corpus <- tm_map(corpus,stripWhitespace)
    corpus <- tm_map(corpus,removePunctuation)
    
    # COMMAND TO CHANGE THE CLASS AND AVOID THIS ERROR
    corpus <- Corpus(VectorSource(corpus))
    matriz_terms <- DocumentTermMatrix(corpus)
    

    Therefore, you won't have more problems with this.

提交回复
热议问题