Suppose I have a RowMatrix.
For very large and sparse matrix, (like the one you get from text feature extraction), the best and easiest way is:
def transposeRowMatrix(m: RowMatrix): RowMatrix = {
val indexedRM = new IndexedRowMatrix(m.rows.zipWithIndex.map({
case (row, idx) => new IndexedRow(idx, row)}))
val transposed = indexedRM.toCoordinateMatrix().transpose.toIndexedRowMatrix()
new RowMatrix(transposed.rows
.map(idxRow => (idxRow.index, idxRow.vector))
.sortByKey().map(_._2))
}
For not so sparse matrix, you can use BlockMatrix as the bridge as mentioned by aletapool's answer above.
However aletapool's answer misses a very important point: When you start from RowMaxtrix -> IndexedRowMatrix -> BlockMatrix -> transpose -> BlockMatrix -> IndexedRowMatrix -> RowMatrix, in the last step (IndexedRowMatrix -> RowMatrix), you have to do a sort. Because by default, converting from IndexedRowMatrix to RowMatrix, the index is simply dropped and the order will be messed up.
val data = Array(
MllibVectors.sparse(5, Seq((1, 1.0), (3, 7.0))),
MllibVectors.dense(2.0, 0.0, 3.0, 4.0, 5.0),
MllibVectors.dense(4.0, 0.0, 0.0, 6.0, 7.0),
MllibVectors.sparse(5, Seq((2, 2.0), (3, 7.0))))
val dataRDD = sc.parallelize(data, 4)
val testMat: RowMatrix = new RowMatrix(dataRDD)
testMat.rows.collect().map(_.toDense).foreach(println)
[0.0,1.0,0.0,7.0,0.0]
[2.0,0.0,3.0,4.0,5.0]
[4.0,0.0,0.0,6.0,7.0]
[0.0,0.0,2.0,7.0,0.0]
transposeRowMatrix(testMat).
rows.collect().map(_.toDense).foreach(println)
[0.0,2.0,4.0,0.0]
[1.0,0.0,0.0,0.0]
[0.0,3.0,0.0,2.0]
[7.0,4.0,6.0,7.0]
[0.0,5.0,7.0,0.0]
If anybody interested, I've implemented the distributed version @javadba had proposed.
def transposeRowMatrix(m: RowMatrix): RowMatrix = {
val transposedRowsRDD = m.rows.zipWithIndex.map{case (row, rowIndex) => rowToTransposedTriplet(row, rowIndex)}
.flatMap(x => x) // now we have triplets (newRowIndex, (newColIndex, value))
.groupByKey
.sortByKey().map(_._2) // sort rows and remove row indexes
.map(buildRow) // restore order of elements in each row and remove column indexes
new RowMatrix(transposedRowsRDD)
}
def rowToTransposedTriplet(row: Vector, rowIndex: Long): Array[(Long, (Long, Double))] = {
val indexedRow = row.toArray.zipWithIndex
indexedRow.map{case (value, colIndex) => (colIndex.toLong, (rowIndex, value))}
}
def buildRow(rowWithIndexes: Iterable[(Long, Double)]): Vector = {
val resArr = new Array[Double](rowWithIndexes.size)
rowWithIndexes.foreach{case (index, value) =>
resArr(index.toInt) = value
}
Vectors.dense(resArr)
}
You can use BlockMatrix, which can be created from an IndexedRowMatrix:
BlockMatrix matA = (new IndexedRowMatrix(...).toBlockMatrix().cache();
matA.validate();
BlockMatrix matB = matA.transpose();
Then, can be easily put back as IndexedRowMatrix. This is described in the spark documentation.
This is a variant of the previous solution but working for sparse row matrix and keeping the transposed sparse too when needed:
def transpose(X: RowMatrix): RowMatrix = {
val m = X.numRows ().toInt
val n = X.numCols ().toInt
val transposed = X.rows.zipWithIndex.flatMap {
case (sp: SparseVector, i: Long) => sp.indices.zip (sp.values).map {case (j, value) => (i, j, value)}
case (dp: DenseVector, i: Long) => Range (0, n).toArray.zip (dp.values).map {case (j, value) => (i, j, value)}
}.sortBy (t => t._1).groupBy (t => t._2).map {case (i, g) =>
val (indices, values) = g.map {case (i, j, value) => (i.toInt, value)}.unzip
if (indices.size == m) {
(i, Vectors.dense (values.toArray) )
} else {
(i, Vectors.sparse (m, indices.toArray, values.toArray))
}
}.sortBy(t => t._1).map (t => t._2)
new RowMatrix (transposed)
}
Hope this help!
Getting the transpose of RowMatrix in Java:
public static RowMatrix transposeRM(JavaSparkContext jsc, RowMatrix mat){
List<Vector> newList=new ArrayList<Vector>();
List<Vector> vs = mat.rows().toJavaRDD().collect();
double [][] tmp=new double[(int)mat.numCols()][(int)mat.numRows()] ;
for(int i=0; i < vs.size(); i++){
double[] rr=vs.get(i).toArray();
for(int j=0; j < mat.numCols(); j++){
tmp[j][i]=rr[j];
}
}
for(int i=0; i < mat.numCols();i++)
newList.add(Vectors.dense(tmp[i]));
JavaRDD<Vector> rows2 = jsc.parallelize(newList);
RowMatrix newmat = new RowMatrix(rows2.rdd());
return (newmat);
}
You are correct: there is no
RowMatrix.transpose()
method. You will need to do this operation manually.
Here is the non-distributed/local matrix versions:
def transpose(m: Array[Array[Double]]): Array[Array[Double]] = {
(for {
c <- m(0).indices
} yield m.map(_(c)) ).toArray
}
The distributed version would be along the following lines:
origMatRdd.rows.zipWithIndex.map{ case (rvect, i) =>
rvect.zipWithIndex.map{ case (ax, j) => ((j,(i,ax))
}.groupByKey
.sortBy{ case (i, ax) => i }
.foldByKey(new DenseVector(origMatRdd.numRows())) { case (dv, (ix,ax)) =>
dv(ix) = ax
}
Caveat: I have not tested the above: it will have bugs. But the basic approach is valid - and similar to work I had done in the past for a small LinAlg library for spark.