Matrix Transpose on RowMatrix in Spark

前端 未结 6 1183
梦谈多话
梦谈多话 2020-12-16 15:20

Suppose I have a RowMatrix.

  1. How can I transpose it. The API documentation does not seem to have a transpose method.
  2. The Matrix has the transpose() me
相关标签:
6条回答
  • 2020-12-16 16:00

    For very large and sparse matrix, (like the one you get from text feature extraction), the best and easiest way is:

    def transposeRowMatrix(m: RowMatrix): RowMatrix = {
      val indexedRM = new IndexedRowMatrix(m.rows.zipWithIndex.map({
        case (row, idx) => new IndexedRow(idx, row)}))
      val transposed = indexedRM.toCoordinateMatrix().transpose.toIndexedRowMatrix()
      new RowMatrix(transposed.rows
        .map(idxRow => (idxRow.index, idxRow.vector))
        .sortByKey().map(_._2))      
    }
    

    For not so sparse matrix, you can use BlockMatrix as the bridge as mentioned by aletapool's answer above.

    However aletapool's answer misses a very important point: When you start from RowMaxtrix -> IndexedRowMatrix -> BlockMatrix -> transpose -> BlockMatrix -> IndexedRowMatrix -> RowMatrix, in the last step (IndexedRowMatrix -> RowMatrix), you have to do a sort. Because by default, converting from IndexedRowMatrix to RowMatrix, the index is simply dropped and the order will be messed up.

    val data = Array(
      MllibVectors.sparse(5, Seq((1, 1.0), (3, 7.0))),
      MllibVectors.dense(2.0, 0.0, 3.0, 4.0, 5.0),
      MllibVectors.dense(4.0, 0.0, 0.0, 6.0, 7.0),
      MllibVectors.sparse(5, Seq((2, 2.0), (3, 7.0))))
    
    val dataRDD = sc.parallelize(data, 4)
    
    val testMat: RowMatrix = new RowMatrix(dataRDD)
    testMat.rows.collect().map(_.toDense).foreach(println)
    
    [0.0,1.0,0.0,7.0,0.0]
    [2.0,0.0,3.0,4.0,5.0]
    [4.0,0.0,0.0,6.0,7.0]
    [0.0,0.0,2.0,7.0,0.0]
    
    transposeRowMatrix(testMat).
      rows.collect().map(_.toDense).foreach(println)
    
    [0.0,2.0,4.0,0.0]
    [1.0,0.0,0.0,0.0]
    [0.0,3.0,0.0,2.0]
    [7.0,4.0,6.0,7.0]
    [0.0,5.0,7.0,0.0]
    
    0 讨论(0)
  • 2020-12-16 16:05

    If anybody interested, I've implemented the distributed version @javadba had proposed.

      def transposeRowMatrix(m: RowMatrix): RowMatrix = {
        val transposedRowsRDD = m.rows.zipWithIndex.map{case (row, rowIndex) => rowToTransposedTriplet(row, rowIndex)}
          .flatMap(x => x) // now we have triplets (newRowIndex, (newColIndex, value))
          .groupByKey
          .sortByKey().map(_._2) // sort rows and remove row indexes
          .map(buildRow) // restore order of elements in each row and remove column indexes
        new RowMatrix(transposedRowsRDD)
      }
    
    
      def rowToTransposedTriplet(row: Vector, rowIndex: Long): Array[(Long, (Long, Double))] = {
        val indexedRow = row.toArray.zipWithIndex
        indexedRow.map{case (value, colIndex) => (colIndex.toLong, (rowIndex, value))}
      }
    
      def buildRow(rowWithIndexes: Iterable[(Long, Double)]): Vector = {
        val resArr = new Array[Double](rowWithIndexes.size)
        rowWithIndexes.foreach{case (index, value) =>
            resArr(index.toInt) = value
        }
        Vectors.dense(resArr)
      } 
    
    0 讨论(0)
  • 2020-12-16 16:11

    You can use BlockMatrix, which can be created from an IndexedRowMatrix:

    BlockMatrix matA = (new IndexedRowMatrix(...).toBlockMatrix().cache();
    matA.validate();
    
    BlockMatrix matB = matA.transpose();
    

    Then, can be easily put back as IndexedRowMatrix. This is described in the spark documentation.

    0 讨论(0)
  • 2020-12-16 16:11

    This is a variant of the previous solution but working for sparse row matrix and keeping the transposed sparse too when needed:

      def transpose(X: RowMatrix): RowMatrix = {
        val m = X.numRows ().toInt
        val n = X.numCols ().toInt
    
        val transposed = X.rows.zipWithIndex.flatMap {
          case (sp: SparseVector, i: Long) => sp.indices.zip (sp.values).map {case (j, value) => (i, j, value)}
          case (dp: DenseVector, i: Long) => Range (0, n).toArray.zip (dp.values).map {case (j, value) => (i, j, value)}
        }.sortBy (t => t._1).groupBy (t => t._2).map {case (i, g) =>
          val (indices, values) = g.map {case (i, j, value) => (i.toInt, value)}.unzip
          if (indices.size == m) {
            (i, Vectors.dense (values.toArray) )
          } else {
            (i, Vectors.sparse (m, indices.toArray, values.toArray))
          }
        }.sortBy(t => t._1).map (t => t._2)
    
        new RowMatrix (transposed)
      }
    

    Hope this help!

    0 讨论(0)
  • 2020-12-16 16:14

    Getting the transpose of RowMatrix in Java:

    public static RowMatrix transposeRM(JavaSparkContext jsc, RowMatrix mat){
    List<Vector> newList=new ArrayList<Vector>();
    List<Vector> vs = mat.rows().toJavaRDD().collect();
    double [][] tmp=new double[(int)mat.numCols()][(int)mat.numRows()] ;
    
    for(int i=0; i < vs.size(); i++){
        double[] rr=vs.get(i).toArray();
        for(int j=0; j < mat.numCols(); j++){
            tmp[j][i]=rr[j];
        }
    }
    
    for(int i=0; i < mat.numCols();i++)
        newList.add(Vectors.dense(tmp[i]));
    
    JavaRDD<Vector> rows2 = jsc.parallelize(newList);
    RowMatrix newmat = new RowMatrix(rows2.rdd());
    return (newmat);
    }
    
    0 讨论(0)
  • 2020-12-16 16:15

    You are correct: there is no

     RowMatrix.transpose()
    

    method. You will need to do this operation manually.

    Here is the non-distributed/local matrix versions:

    def transpose(m: Array[Array[Double]]): Array[Array[Double]] = {
        (for {
          c <- m(0).indices
        } yield m.map(_(c)) ).toArray
    }
    

    The distributed version would be along the following lines:

        origMatRdd.rows.zipWithIndex.map{ case (rvect, i) =>
            rvect.zipWithIndex.map{ case (ax, j) => ((j,(i,ax))
        }.groupByKey
        .sortBy{ case (i, ax) => i }
        .foldByKey(new DenseVector(origMatRdd.numRows())) { case (dv, (ix,ax))  =>
                  dv(ix) = ax
         }
    

    Caveat: I have not tested the above: it will have bugs. But the basic approach is valid - and similar to work I had done in the past for a small LinAlg library for spark.

    0 讨论(0)
提交回复
热议问题