Pyspark and PCA: How can I extract the eigenvectors of this PCA? How can I calculate how much variance they are explaining?

前端未结

关注

 4  1074

忘掉有多难 2020-12-04 17:04

I am reducing the dimensionality of a Spark DataFrame with PCA model with pyspark (using the spark ml library) as follows

4条回答

死守一世寂寞 (楼主)

2020-12-04 17:24

The easiest answer to your question is to input an identity matrix to your model.

identity_input = [(Vectors.dense([1.0, .0, 0.0, .0, 0.0]),),(Vectors.dense([.0, 1.0, .0, .0, .0]),), \
              (Vectors.dense([.0, 0.0, 1.0, .0, .0]),),(Vectors.dense([.0, 0.0, .0, 1.0, .0]),),
              (Vectors.dense([.0, 0.0, .0, .0, 1.0]),)]
df_identity = sqlContext.createDataFrame(identity_input,["features"])
identity_features = model.transform(df_identity)

This should give you principle components.

I think eliasah's answer is better in terms of Spark framework because desertnaut is solving the problem by using numpy's functions instead of Spark's actions. However, eliasah's answer is missing normalizing the data. So, I'd add the following lines to eliasah's answer:

from pyspark.ml.feature import StandardScaler
standardizer = StandardScaler(withMean=True, withStd=False,
                          inputCol='features',
                          outputCol='std_features')
model = standardizer.fit(df)
output = model.transform(df)
pca_features = output.select("std_features").rdd.map(lambda row : row[0])
mat = RowMatrix(pca_features)
svd = computeSVD(mat,5,True)

Evantually, svd.V and identity_features.select("pca_features").collect() should have identical values.

Edit: I summarized PCA and its use in Spark and sklearn in this here

0 讨论(0)

查看其它4个回答