tm package error “Cannot convert DocumentTermMatrix into normal matrix since vector is too large”

I have created a DocumentTermMatrix that contains 1859 documents (rows) and 25722 (columns). In order to perform further calculations on this matrix I need to convert it to a regular matrix. I want to use the as.matrix() command. However, it returns the following error: cannot allocate vector of size 364.8 MB.

> corp
A corpus with 1859 text documents
> mat<-DocumentTermMatrix(corp)
> dim(mat)
[1]  1859 25722
> is(mat)
[1] "DocumentTermMatrix"
> mat2<-as.matrix(mat)
Fehler: kann Vektor der Größe 364.8 MB nicht allozieren # cannot allocate vector of size 364.8 MB
> object.size(mat)
5502000 bytes

For some reason the size of the object seems to increase dramatically whenever it is transformed to a regular matrix. How can I avoid this?

Or is there an alternative way to perform regular matrix operations on a DocumentTermMatrix?

The quick and dirty way is to export your data into a sparse matrix object from an external package like Matrix.

> attributes(dtm)
$names
[1] "i"        "j"        "v"        "nrow"     "ncol"     "dimnames"

$class
[1] "DocumentTermMatrix"    "simple_triplet_matrix"

$Weighting
[1] "term frequency" "tf"

The dtm object has the i, j and v attributes which is the internal representation of your documenttermmatrix. Use:

library("Matrix") mat <- sparseMatrix(i=dtm$i, j=dtm$j, x=dtm$v,
   dims=c(dtm$nrow, dtm$ncol))

and you're done.

A naive comparison between your objects:

> mat[1,1:100]
> head(as.vector(dtm[1,]), 100)

will each give your the exact same output.

DocumentTermMatrix uses sparse matrix representation, so it doesn't take up all that memory storing all those zeros. Depending what it is you want to do you might have some luck with the SparseM package which provides some linear algebra routines using sparse matrices..

tim riffe

Are you able to increase the amount of RAM available to R? See this post: Increasing (or decreasing) the memory available to R processes

Also, sometimes when working with big objects in R, I occassionally call gc() to free up wasted memory.

The number of documents should not be a problem but you may want to try removing sparse terms, this could very well reduce the dimension of document term matrix.

inspect(removeSparseTerms(dtm, 0.7))

It removes terms that has at least a sparsity of 0.7.

Another option available to you is that you specify minimum word length and minimum document frequency when you create document term matrix

a.dtm <- DocumentTermMatrix(a.corpus, control = list(weighting = weightTfIdf, minWordLength = 2, minDocFreq = 5))

use inspect(dtm) before and after your changes, you will see huge difference, more importantly you won't ruin significant relations hidden in your docs and terms.

user3434580

Since you only have 1859 documents, the distance matrix you need to compute is fairly small. Using the slam package (and in particular, its crossapply_simple_triplet_matrix function), you might be able to compute the distance matrix directly, instead of converting the DTM into a dense matrix first. This means that you will have to compute the Jaccard similarity yourself. I have successfully tried something similar for the cosine distance matrix on a large number of documents.

来源：https://stackoverflow.com/questions/7380133/tm-package-error-cannot-convert-documenttermmatrix-into-normal-matrix-since-vec

标签