Spark Matrix multiplication with python

我怕爱的太早我们不能终老 提交于 2019-12-09 00:13:25

问题


I am trying to do matrix multiplication using Apache Spark and Python.

Here is my data

from pyspark.mllib.linalg.distributed import RowMatrix

My RDD of vectors

rows_1 = sc.parallelize([[1, 2], [4, 5], [7, 8]])
rows_2 = sc.parallelize([[1, 2], [4, 5]])

My maxtrix

mat1 = RowMatrix(rows_1)
mat2 = RowMatrix(rows_2)

I would like to do something like this:

mat = mat1 * mat2

I wrote a function to process the matrix multiplication but I'm afraid to have a long processing time. Here is my function:

def matrix_multiply(df1, df2):
    nb_row = df1.count()    
    mat=[]
    for i in range(0, nb_row):
        row=list(df1.filter(df1['index']==i).take(1)[0])
        row_out = []
        for r in range(0, len(row)):
            r_value = 0
            col = df2.select(df2[list_col[r]]).collect()
            col = [list(c)[0] for c in col]
            for c in range(0, len(col)): 
                r_value += row[c] * col[c]
            row_out.append(r_value)            
        mat.append(row_out)
    return mat 

My function make a lot of spark actions (take, collect, etc.). Does the function will take a lot of processing time? If someone have another idea it will be helpful for me.


回答1:


You cannot. Since RowMatrix has no meaningful row indices it cannot be used for multiplications. Even ignoring that the only distributed matrix which supports multiplication with another distributed structure is BlockMatrix.

from pyspark.mllib.linalg.distributed import *

def as_block_matrix(rdd, rowsPerBlock=1024, colsPerBlock=1024):
    return IndexedRowMatrix(
        rdd.zipWithIndex().map(lambda xi: IndexedRow(xi[1], xi[0]))
    ).toBlockMatrix(rowsPerBlock, colsPerBlock)

as_block_matrix(rows_1).multiply(as_block_matrix(rows_2))


来源:https://stackoverflow.com/questions/37766213/spark-matrix-multiplication-with-python

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!