Row sum for large term-document matrix / simple_triplet_matrix ?? {tm package}

空扰寡人 提交于 2019-12-03 07:06:20

问题


So I have a very large term-document matrix:

> class(ph.DTM)
[1] "TermDocumentMatrix"    "simple_triplet_matrix"

> ph.DTM
A term-document matrix (109996 terms, 262811 documents)

Non-/sparse entries: 3705693/28904453063
Sparsity           : 100%
Maximal term length: 191 
Weighting          : term frequency (tf)

How do I get the rowSum (frequency) of each term? I tried:

> apply(ph.DTM, 1, sum)
Error in vector(typeof(x$v), nr * nc) : vector size cannot be NA
In addition: Warning message:
In nr * nc : NAs produced by integer overflow

Obviously, I know about removeSparseTerms:

ph.DTM2 <- removeSparseTerms(ph.DTM, 0.99999)

Which cuts down the size a bit:

> ph.DTM2
A term-document matrix (28842 terms, 262811 documents)

Non-/sparse entries: 3612620/7576382242
Sparsity           : 100%
Maximal term length: 24 
Weighting          : term frequency (tf)

But I still cannot apply any matrix-related functions to it:

> as.matrix(ph.DTM2)
Error in vector(typeof(x$v), nr * nc) : vector size cannot be NA
In addition: Warning message:
In nr * nc : NAs produced by integer overflow

How can I just get a simple row sum on this object?? Thanks!!


回答1:


OK, after some more Google'ing, I came across the slam package, which enables:

ph.DTM3 <- rollup(ph.DTM, 2, na.rm=TRUE, FUN = sum)

Which works.




回答2:


As alluded to by @badpanda in one of the comments, slam now has the row_sums and col_sums functions for sparse arrays:

slam::row_sums(dtm, na.rm = T)
slam::col_sums(tdm, na.rm = T)



回答3:


I think:

 rowSums(as.matrix(ph.DTM))

Would work as well.



来源:https://stackoverflow.com/questions/21921422/row-sum-for-large-term-document-matrix-simple-triplet-matrix-tm-package

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!