How to access CoordinateMatrix entries directly in Spark?

问题

I want to store a big sparse matrix using Spark, so I tried to use CoordinateMatrix, since it is a distributed matrix.

However, I have not found a way to access each entry directly such as this way:

apply(int x, int y)

I only found the functions like:

public RDD<MatrixEntry> entries()

In this case, I have to loop over the entries to find out the one I want, which is not efficient way.

Has anyone used CoordinateMatrix before ?

What should I do to get each entry from CoordinateMatrix efficiently?

回答1:

Short answer is you don't. RDDs, and CoordinateMatrix is more or less a wrapper around the RDD[MatrixEntry], are not well suited for random access. Moreover RDDs are immutable so you cannot simply modify a single entry. If it is your requirement you're probably looking at the wrong technology.

There is some limited support for random access if you use PairRDD. If such a RDD is partitioned you can use lookup method to efficiently recover a single value:

val n = ??? // Number of partitions
val pairs = mat.
    entries.
    map{case MatrixEntry(i, j, v) => ((i, j), v)}.
    partitionBy(new HashPartitioner(n))
pairs.lookup((1, 1))

来源：https://stackoverflow.com/questions/31618748/how-to-access-coordinatematrix-entries-directly-in-spark

标签

matrix

apache-spark

distributed-computing

sparse-matrix

易学教程内所有资源均来自网络或用户发布的内容，如有违反法律规定的内容欢迎反馈！
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!