I am trying to do matrix multiplication using Apache Spark and Python.
Here is my data
from pyspark.mllib.linalg.distributed import RowMatrix
You cannot. Since RowMatrix
has no meaningful row indices it cannot be used for multiplications. Even ignoring that the only distributed matrix which supports multiplication with another distributed structure is BlockMatrix
.
from pyspark.mllib.linalg.distributed import *
def as_block_matrix(rdd, rowsPerBlock=1024, colsPerBlock=1024):
return IndexedRowMatrix(
rdd.zipWithIndex().map(lambda xi: IndexedRow(xi[1], xi[0]))
).toBlockMatrix(rowsPerBlock, colsPerBlock)
as_block_matrix(rows_1).multiply(as_block_matrix(rows_2))