How to access CoordinateMatrix entries directly in Spark?

不打扰是莪最后的温柔 提交于 2020-01-16 01:01:47

问题


I want to store a big sparse matrix using Spark, so I tried to use CoordinateMatrix, since it is a distributed matrix.

However, I have not found a way to access each entry directly such as this way:

apply(int x, int y)

I only found the functions like:

public RDD<MatrixEntry> entries()

In this case, I have to loop over the entries to find out the one I want, which is not efficient way.

Has anyone used CoordinateMatrix before ?

What should I do to get each entry from CoordinateMatrix efficiently?


回答1:


Short answer is you don't. RDDs, and CoordinateMatrix is more or less a wrapper around the RDD[MatrixEntry], are not well suited for random access. Moreover RDDs are immutable so you cannot simply modify a single entry. If it is your requirement you're probably looking at the wrong technology.

There is some limited support for random access if you use PairRDD. If such a RDD is partitioned you can use lookup method to efficiently recover a single value:

val n = ??? // Number of partitions
val pairs = mat.
    entries.
    map{case MatrixEntry(i, j, v) => ((i, j), v)}.
    partitionBy(new HashPartitioner(n))
pairs.lookup((1, 1))


来源:https://stackoverflow.com/questions/31618748/how-to-access-coordinatematrix-entries-directly-in-spark

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!