I have a DataFrame in Apache Spark with an array of integers, the source is a set of images. I ultimately want to do PCA on it, but I am having trouble just creating a matri
Since you didn't provide an example input I'll assume it looks more or less like this where id
is a row number and image
contains values.
traindf = sqlContext.createDataFrame([
(1, [1, 2, 3]),
(2, [4, 5, 6]),
(3, (7, 8, 9))
], ("id", "image"))
First thing you have to understand is that the DenseMatrix
is a local data structure. To be precise it is a wrapper around numpy.ndarray
. As for now (Spark 1.4.1) there are no distributed equivalents in PySpark MLlib.
Dense Matrix take three mandatory arguments numRows
, numCols
, values
where values
is a local data structure. In your case you have to collect first:
values = (traindf.
rdd.
map(lambda r: (r.id, r.image)). # Extract row id and data
sortByKey(). # Sort by row id
flatMap(lambda (id, image): image).
collect())
ncol = len(traindf.rdd.map(lambda r: r.image).first())
nrow = traindf.count()
dm = DenseMatrix(nrow, ncol, values)
Finally:
> print dm.toArray()
[[ 1. 4. 7.]
[ 2. 5. 8.]
[ 3. 6. 9.]]
Edit:
In Spark 1.5+ you can use mllib.linalg.distributed
as follows:
from pyspark.mllib.linalg.distributed import IndexedRow, IndexedRowMatrix
mat = IndexedRowMatrix(traindf.map(lambda row: IndexedRow(*row)))
mat.numRows()
## 4
mat.numCols()
## 3
although as for now API is still to limited to be useful in practice.