问题
I want to store a big sparse matrix using Spark,
so I tried to use CoordinateMatrix
, since it is a distributed matrix.
However, I have not found a way to access each entry directly such as this way:
apply(int x, int y)
I only found the functions like:
public RDD<MatrixEntry> entries()
In this case, I have to loop over the entries to find out the one I want, which is not efficient way.
Has anyone used CoordinateMatrix
before ?
What should I do to get each entry from CoordinateMatrix
efficiently?
回答1:
Short answer is you don't. RDDs, and CoordinateMatrix
is more or less a wrapper around the RDD[MatrixEntry]
, are not well suited for random access. Moreover RDDs are immutable so you cannot simply modify a single entry. If it is your requirement you're probably looking at the wrong technology.
There is some limited support for random access if you use PairRDD
. If such a RDD is partitioned you can use lookup
method to efficiently recover a single value:
val n = ??? // Number of partitions
val pairs = mat.
entries.
map{case MatrixEntry(i, j, v) => ((i, j), v)}.
partitionBy(new HashPartitioner(n))
pairs.lookup((1, 1))
来源:https://stackoverflow.com/questions/31618748/how-to-access-coordinatematrix-entries-directly-in-spark