pyspark matrix accumulator

[亡魂溺海] 提交于 2019-12-11 02:22:35

问题


I want to additively populate a matrix with values inferred from an rdd using a pyspark accumulator; I found the docs a bit unclear. Adding a bit of background, just in case its relevant.
My rddData contains lists of indexes for which one count has to be added to the matrix. For example this list maps to indices:
[1,3,4] -> (11), (13), (14), (33), (34), (44)

Now, here is my accumulator:

from pyspark.accumulators import AccumulatorParam
class MatrixAccumulatorParam(AccumulatorParam):
    def zero(self, mInitial):
        import numpy as np
        aaZeros = np.zeros(mInitial.shape)
        return aaZeros

    def addInPlace(self, mAdd, lIndex):
        mAdd[lIndex[0], lIndex[1]] += 1
        return mAdd

So this is my mapper function:

def populate_sparse(lIndices):
    for i1 in lIndices:
        for i2 in lIndices:
            oAccumilatorMatrix.add([i1, i2])

And then run the data:

oAccumilatorMatrix = oSc.accumulator(aaZeros, MatrixAccumulatorParam())

rddData.map(populate_sparse).collect()

Now, when I look at my data:

sum(sum(oAccumilatorMatrix.value))
#= 0.0

Which it shouldn't be. What am I missing?

EDIT Tried this with a sparse matrix at first, got this traceback that sparse matrices are not supported. Changed question for dense numpy matrix:

...

    raise IndexError("Indexing with sparse matrices is not supported"
IndexError: Indexing with sparse matrices is not supported except boolean indexing where matrix and index are equal shapes.

回答1:


Aha! I think I got it. The accumulator, at the end of the day, still needs to add its own pieces to itself. So, change addInPlace to:

def addInPlace(self, mAdd, lIndex):
    if type(lIndex) == list:
        mAdd[lIndex[0], lIndex[1]] += 1
    else:
        mAdd += lIndex
    return mAdd

So now it adds indices when it is given a list, and adds itself after the populate_sparse function loop to create my final matrix.



来源:https://stackoverflow.com/questions/36196648/pyspark-matrix-accumulator

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!