pyspark matrix accumulator

问题

I want to additively populate a matrix with values inferred from an rdd using a pyspark accumulator; I found the docs a bit unclear. Adding a bit of background, just in case its relevant.
My rddData contains lists of indexes for which one count has to be added to the matrix. For example this list maps to indices:
[1,3,4] -> (11), (13), (14), (33), (34), (44)

Now, here is my accumulator:

from pyspark.accumulators import AccumulatorParam
class MatrixAccumulatorParam(AccumulatorParam):
    def zero(self, mInitial):
        import numpy as np
        aaZeros = np.zeros(mInitial.shape)
        return aaZeros

    def addInPlace(self, mAdd, lIndex):
        mAdd[lIndex[0], lIndex[1]] += 1
        return mAdd

So this is my mapper function:

def populate_sparse(lIndices):
    for i1 in lIndices:
        for i2 in lIndices:
            oAccumilatorMatrix.add([i1, i2])

And then run the data:

oAccumilatorMatrix = oSc.accumulator(aaZeros, MatrixAccumulatorParam())

rddData.map(populate_sparse).collect()

Now, when I look at my data:

sum(sum(oAccumilatorMatrix.value))
#= 0.0

Which it shouldn't be. What am I missing?

EDIT Tried this with a sparse matrix at first, got this traceback that sparse matrices are not supported. Changed question for dense numpy matrix:

...

    raise IndexError("Indexing with sparse matrices is not supported"
IndexError: Indexing with sparse matrices is not supported except boolean indexing where matrix and index are equal shapes.

回答1:

Aha! I think I got it. The accumulator, at the end of the day, still needs to add its own pieces to itself. So, change addInPlace to:

def addInPlace(self, mAdd, lIndex):
    if type(lIndex) == list:
        mAdd[lIndex[0], lIndex[1]] += 1
    else:
        mAdd += lIndex
    return mAdd

So now it adds indices when it is given a list, and adds itself after the populate_sparse function loop to create my final matrix.

来源：https://stackoverflow.com/questions/36196648/pyspark-matrix-accumulator

标签

python

sparse-matrix

pyspark