Apache Spark: How to create a matrix from a DataFrame?

前端 未结 1 930
轮回少年
轮回少年 2020-12-30 08:07

I have a DataFrame in Apache Spark with an array of integers, the source is a set of images. I ultimately want to do PCA on it, but I am having trouble just creating a matri

相关标签:
1条回答
  • 2020-12-30 08:50

    Since you didn't provide an example input I'll assume it looks more or less like this where id is a row number and image contains values.

    traindf = sqlContext.createDataFrame([
        (1, [1, 2, 3]),
        (2, [4, 5, 6]),
        (3, (7, 8, 9))
    ], ("id", "image"))
    

    First thing you have to understand is that the DenseMatrix is a local data structure. To be precise it is a wrapper around numpy.ndarray. As for now (Spark 1.4.1) there are no distributed equivalents in PySpark MLlib.

    Dense Matrix take three mandatory arguments numRows, numCols, values where values is a local data structure. In your case you have to collect first:

    values = (traindf.
        rdd.
        map(lambda r: (r.id, r.image)). # Extract row id and data
        sortByKey(). # Sort by row id
        flatMap(lambda (id, image): image).
        collect())
    
    
    ncol = len(traindf.rdd.map(lambda r: r.image).first())
    nrow = traindf.count()
    
    dm = DenseMatrix(nrow, ncol, values)
    

    Finally:

    > print dm.toArray()
    [[ 1.  4.  7.]
     [ 2.  5.  8.]
     [ 3.  6.  9.]]
    

    Edit:

    In Spark 1.5+ you can use mllib.linalg.distributed as follows:

    from pyspark.mllib.linalg.distributed import IndexedRow, IndexedRowMatrix
    
    mat = IndexedRowMatrix(traindf.map(lambda row: IndexedRow(*row)))
    mat.numRows()
    ## 4
    mat.numCols()
    ## 3
    

    although as for now API is still to limited to be useful in practice.

    0 讨论(0)
提交回复
热议问题