Efficient Way to Convert CSV of Sparse Distances to Dist Object R

爱⌒轻易说出口 提交于 2019-12-11 05:08:20

问题


I have a very large csv file (about 91 million rows so a for loop takes too long in R) of similarities between keywords (about 50,000 unique keywords) that when I read into a data.frame looks like:

> df   
kwd1 kwd2 similarity  
a  b  1  
b  a  1  
c  a  2  
a  c  2 

It is a sparse list and I can convert it into a sparse matrix using sparseMatrix():

> myMatrix 
  a b c  
a . 1 2
b 1 . .
c 2 . .

However, now I would like to convert this into a dist object. I tried as.dist(myMatrix) but I was given the error that the 'problem was too large' for as.dist(). I also tried converting the sparse matrix to a lower triangular sparse matrix then to a dist object (thinking this might be better) using myMatrix = myMatrix * lower.tri(myMatrix), but I then had the same error but with regard to the lower.tri function.

Thanks for any help!


回答1:


An object of class "dist" is a dense object. To go from the sparse representation will require a vector on the order of

R> 0.5*(91000000*90999999)
[1] 4.1405e+15

elements (give or take for the diagonal). In R, the maximum length of a vector is 2^31 - 1:

R> 2^31 - 1
[1] 2147483647

which is way smaller than the number of elements you need to store the dense "dist" object so it won't be possible and that is the reason for the error from dist(). For similar reasons you won't be able to store the lower triangle version of the data as a dense object as it too is held as a vector with the same length limits.

At this point I think you'll need to explain more about the actual problem and what you want the dissimilarity object for (in another Question!)? Do you need all dissimilarities between the 91 million objects or could you get by with a sample from this that will fit into the current length limitations for R's vectors?



来源:https://stackoverflow.com/questions/12379233/efficient-way-to-convert-csv-of-sparse-distances-to-dist-object-r

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!