Converting a Document Term Matrix into a Matrix with lots of data causes overflow

前端 未结 3 1833
孤独总比滥情好
孤独总比滥情好 2020-12-29 09:12

Let\'s do some Text Mining

Here I stand with a document term matrix (from the tm Package)

dtm <- TermDocumentMatrix(
     myCorpus,
          


        
3条回答
  •  [愿得一人]
    2020-12-29 09:39

    Integer overflow tells you exactly what the problem is : with 40000 documents, you have too much data. It is in the conversion to a matrix that the problem begins btw, which can be seen if you look at the code of the underlying function :

    class(dtm)
    [1] "TermDocumentMatrix"    "simple_triplet_matrix"
    
    getAnywhere(as.matrix.simple_triplet_matrix)
    
    A single object matching ‘as.matrix.simple_triplet_matrix’ was found
    ...
    function (x, ...) 
    {
        nr <- x$nrow
        nc <- x$ncol
        y <- matrix(vector(typeof(x$v), nr * nc), nr, nc)
       ...
    }
    

    This is the line referenced by the error message. What's going on, can be easily simulated by :

    as.integer(40000 * 60000) # 40000 documents is 40000 rows in the resulting frame
    [1] NA
    Warning message:
    NAs introduced by coercion 
    

    The function vector() takes an argument with the length, in this case nr*nc If this is larger than appx. 2e9 ( .Machine$integer.max ), it will be replaced by NA. This NA is not valid as an argument for vector().

    Bottomline : You're running into the limits of R. As for now, working in 64bit won't help you. You'll have to resort to different methods. One possibility would be to continue working with the list you have (dtm is a list), selecting the data you need using list manipulation and go from there.

    PS : I made a dtm object by

    require(tm)
    data("crude")
    dtm <- TermDocumentMatrix(crude,
                              control = list(weighting = weightTfIdf,
                                             stopwords = TRUE))
    

提交回复
热议问题