问题
I want to additively populate a matrix with values inferred from an rdd
using a pyspark accumulator; I found the docs a bit unclear. Adding a bit of background, just in case its relevant.
My rddData
contains lists of indexes for which one count has to be added to the matrix. For example this list maps to indices:[1,3,4] -> (11), (13), (14), (33), (34), (44)
Now, here is my accumulator:
from pyspark.accumulators import AccumulatorParam
class MatrixAccumulatorParam(AccumulatorParam):
def zero(self, mInitial):
import numpy as np
aaZeros = np.zeros(mInitial.shape)
return aaZeros
def addInPlace(self, mAdd, lIndex):
mAdd[lIndex[0], lIndex[1]] += 1
return mAdd
So this is my mapper function:
def populate_sparse(lIndices):
for i1 in lIndices:
for i2 in lIndices:
oAccumilatorMatrix.add([i1, i2])
And then run the data:
oAccumilatorMatrix = oSc.accumulator(aaZeros, MatrixAccumulatorParam())
rddData.map(populate_sparse).collect()
Now, when I look at my data:
sum(sum(oAccumilatorMatrix.value))
#= 0.0
Which it shouldn't be. What am I missing?
EDIT Tried this with a sparse matrix at first, got this traceback that sparse matrices are not supported. Changed question for dense numpy matrix:
...
raise IndexError("Indexing with sparse matrices is not supported"
IndexError: Indexing with sparse matrices is not supported except boolean indexing where matrix and index are equal shapes.
回答1:
Aha! I think I got it. The accumulator, at the end of the day, still needs to add its own pieces to itself. So, change addInPlace
to:
def addInPlace(self, mAdd, lIndex):
if type(lIndex) == list:
mAdd[lIndex[0], lIndex[1]] += 1
else:
mAdd += lIndex
return mAdd
So now it adds indices when it is given a list, and adds itself after the populate_sparse
function loop to create my final matrix.
来源:https://stackoverflow.com/questions/36196648/pyspark-matrix-accumulator