tm package error “Cannot convert DocumentTermMatrix into normal matrix since vector is too large”

北城以北 提交于 2019-12-05 11:16:21

The quick and dirty way is to export your data into a sparse matrix object from an external package like Matrix.

> attributes(dtm)
$names
[1] "i"        "j"        "v"        "nrow"     "ncol"     "dimnames"

$class
[1] "DocumentTermMatrix"    "simple_triplet_matrix"

$Weighting
[1] "term frequency" "tf"            

The dtm object has the i, j and v attributes which is the internal representation of your documenttermmatrix. Use:

library("Matrix") mat <- sparseMatrix(i=dtm$i, j=dtm$j, x=dtm$v,
   dims=c(dtm$nrow, dtm$ncol))

and you're done.

A naive comparison between your objects:

> mat[1,1:100]
> head(as.vector(dtm[1,]), 100)

will each give your the exact same output.

DocumentTermMatrix uses sparse matrix representation, so it doesn't take up all that memory storing all those zeros. Depending what it is you want to do you might have some luck with the SparseM package which provides some linear algebra routines using sparse matrices..

tim riffe

Are you able to increase the amount of RAM available to R? See this post: Increasing (or decreasing) the memory available to R processes

Also, sometimes when working with big objects in R, I occassionally call gc() to free up wasted memory.

The number of documents should not be a problem but you may want to try removing sparse terms, this could very well reduce the dimension of document term matrix.

inspect(removeSparseTerms(dtm, 0.7))

It removes terms that has at least a sparsity of 0.7.

Another option available to you is that you specify minimum word length and minimum document frequency when you create document term matrix

a.dtm <- DocumentTermMatrix(a.corpus, control = list(weighting = weightTfIdf, minWordLength = 2, minDocFreq = 5))

use inspect(dtm) before and after your changes, you will see huge difference, more importantly you won't ruin significant relations hidden in your docs and terms.

user3434580

Since you only have 1859 documents, the distance matrix you need to compute is fairly small. Using the slam package (and in particular, its crossapply_simple_triplet_matrix function), you might be able to compute the distance matrix directly, instead of converting the DTM into a dense matrix first. This means that you will have to compute the Jaccard similarity yourself. I have successfully tried something similar for the cosine distance matrix on a large number of documents.

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!