Multiply SparseVectors element-wise

不想你离开。 提交于 2021-02-08 08:15:17

问题


I am having 2RDD and I want to multiply element-wise between these 2 rdd.

Lets say that I am having the following RDD(example):

a = ((1,[0.28,1,0.55]),(2,[0.28,1,0.55]),(3,[0.28,1,0.55]))
aRDD = sc.parallelize(a)
b = ((1,[0.28,0,0]),(2,[0,0,0]),(3,[0,1,0]))
bRDD = sc.parallelize(b)

It can be seen that b is sparse and I want to avoid multiply a zero value with another value. I am doing the following:

from pyspark.mllib.linalg import Vectors
def create_sparce_matrix(a_list):
    length = len(a_list)
    index = [i for i ,e in enumerate(a_list) if e !=0]
    value = [e for i ,e in enumerate(a_list) if e !=0]
    sv1 = Vectors.sparse(length,index,value)
    return sv1


brdd = b.map(lambda (ids,a_list):(ids,create_sparce_matrix(a_list)))

And multiplication:

combinedRDD = ardd + brdd
result = combinedRDD.reduceByKey(lambda a,b:[c*d for c,d in zip(a,b)])

It seems that I can't multiply an sparce with a list in RDD. Is there a way to do it?Or another effiecient way to multiply element-wise when one of the two RDD has a lot of zero values?


回答1:


One way you can handle this is to convert aRDD to RDD[DenseVector]:

from pyspark.mllib.linalg import SparseVector, DenseVector, Vectors

aRDD = sc.parallelize(a).mapValues(DenseVector)
bRDD = sc.parallelize(b).mapValues(create_sparce_matrix)

and use basic NumPy operations:

def mul(x, y):
    assert isinstance(x, DenseVector)
    assert isinstance(y, SparseVector)
    assert x.size == y.size
    return SparseVector(y.size, y.indices, x[y.indices] * y.values)

aRDD.join(bRDD).mapValues(lambda xy: mul(*xy))


来源:https://stackoverflow.com/questions/35363542/multiply-sparsevectors-element-wise

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!