Pyspark and PCA: How can I extract the eigenvectors of this PCA? How can I calculate how much variance they are explaining?

前端 未结 4 1074
忘掉有多难
忘掉有多难 2020-12-04 17:04

I am reducing the dimensionality of a Spark DataFrame with PCA model with pyspark (using the spark ml library) as follows

4条回答
  •  死守一世寂寞
    2020-12-04 17:24

    The easiest answer to your question is to input an identity matrix to your model.

    identity_input = [(Vectors.dense([1.0, .0, 0.0, .0, 0.0]),),(Vectors.dense([.0, 1.0, .0, .0, .0]),), \
                  (Vectors.dense([.0, 0.0, 1.0, .0, .0]),),(Vectors.dense([.0, 0.0, .0, 1.0, .0]),),
                  (Vectors.dense([.0, 0.0, .0, .0, 1.0]),)]
    df_identity = sqlContext.createDataFrame(identity_input,["features"])
    identity_features = model.transform(df_identity)
    

    This should give you principle components.

    I think eliasah's answer is better in terms of Spark framework because desertnaut is solving the problem by using numpy's functions instead of Spark's actions. However, eliasah's answer is missing normalizing the data. So, I'd add the following lines to eliasah's answer:

    from pyspark.ml.feature import StandardScaler
    standardizer = StandardScaler(withMean=True, withStd=False,
                              inputCol='features',
                              outputCol='std_features')
    model = standardizer.fit(df)
    output = model.transform(df)
    pca_features = output.select("std_features").rdd.map(lambda row : row[0])
    mat = RowMatrix(pca_features)
    svd = computeSVD(mat,5,True)
    

    Evantually, svd.V and identity_features.select("pca_features").collect() should have identical values.

    Edit: I summarized PCA and its use in Spark and sklearn in this here

提交回复
热议问题