Matrix Transpose on RowMatrix in Spark

前端 未结 6 1188
梦谈多话
梦谈多话 2020-12-16 15:20

Suppose I have a RowMatrix.

  1. How can I transpose it. The API documentation does not seem to have a transpose method.
  2. The Matrix has the transpose() me
6条回答
  •  南方客
    南方客 (楼主)
    2020-12-16 16:00

    For very large and sparse matrix, (like the one you get from text feature extraction), the best and easiest way is:

    def transposeRowMatrix(m: RowMatrix): RowMatrix = {
      val indexedRM = new IndexedRowMatrix(m.rows.zipWithIndex.map({
        case (row, idx) => new IndexedRow(idx, row)}))
      val transposed = indexedRM.toCoordinateMatrix().transpose.toIndexedRowMatrix()
      new RowMatrix(transposed.rows
        .map(idxRow => (idxRow.index, idxRow.vector))
        .sortByKey().map(_._2))      
    }
    

    For not so sparse matrix, you can use BlockMatrix as the bridge as mentioned by aletapool's answer above.

    However aletapool's answer misses a very important point: When you start from RowMaxtrix -> IndexedRowMatrix -> BlockMatrix -> transpose -> BlockMatrix -> IndexedRowMatrix -> RowMatrix, in the last step (IndexedRowMatrix -> RowMatrix), you have to do a sort. Because by default, converting from IndexedRowMatrix to RowMatrix, the index is simply dropped and the order will be messed up.

    val data = Array(
      MllibVectors.sparse(5, Seq((1, 1.0), (3, 7.0))),
      MllibVectors.dense(2.0, 0.0, 3.0, 4.0, 5.0),
      MllibVectors.dense(4.0, 0.0, 0.0, 6.0, 7.0),
      MllibVectors.sparse(5, Seq((2, 2.0), (3, 7.0))))
    
    val dataRDD = sc.parallelize(data, 4)
    
    val testMat: RowMatrix = new RowMatrix(dataRDD)
    testMat.rows.collect().map(_.toDense).foreach(println)
    
    [0.0,1.0,0.0,7.0,0.0]
    [2.0,0.0,3.0,4.0,5.0]
    [4.0,0.0,0.0,6.0,7.0]
    [0.0,0.0,2.0,7.0,0.0]
    
    transposeRowMatrix(testMat).
      rows.collect().map(_.toDense).foreach(println)
    
    [0.0,2.0,4.0,0.0]
    [1.0,0.0,0.0,0.0]
    [0.0,3.0,0.0,2.0]
    [7.0,4.0,6.0,7.0]
    [0.0,5.0,7.0,0.0]
    

提交回复
热议问题