PySpark PCA: avoiding NotConvergedException

问题

I'm attempting to reduce a wide dataset (51 features, ~1300 individuals) using PCA through the ml.linalg method as follows:

1) Named my columns as one list:

features = indi_prep_df.select([c for c in indi_prep_df.columns if c not in{'indi_nbr','label'}]).columns

2) Imported the necessary libraries

from pyspark.ml.feature import PCA as PCAML
from pyspark.ml.linalg import Vector
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.linalg import DenseVector

3) Collapsed the features to a DenseVector

indi_feat = indi_prep_df.rdd.map(lambda x: (x[0], x[-1], DenseVector(x[1:-2]))).toDF(['indi_nbr','label','features'])

4) Dropped everything but the features to retain index:

dftest = indi_feat.drop('indi_nbr','label')

5) Instantiated the PCA object

dfPCA = PCAML(k=3, inputCol="features", outputCol="pcafeats")

6) And attempted to fit the model

PCAout = dfPCA.fit(dftest)

But my model fails to converge (error below). Things I've tried: - Mean-filling or zero-filling NA and Null values (as appropriate) - Reducing the number of features (to 25, then I switched to SKlearn's PCA)

    Py4JJavaError: An error occurred while calling o2242.fit.
: breeze.linalg.NotConvergedException: 
    at breeze.linalg.svd$.breeze$linalg$svd$$doSVD_Double(svd.scala:110)
    at breeze.linalg.svd$Svd_DM_Impl$.apply(svd.scala:40)
    at breeze.linalg.svd$Svd_DM_Impl$.apply(svd.scala:39)
    at breeze.generic.UFunc$class.apply(UFunc.scala:48)
    at breeze.linalg.svd$.apply(svd.scala:23)
    at org.apache.spark.mllib.linalg.distributed.RowMatrix.computePrincipalComponentsAndExplainedVariance(RowMatrix.scala:389)
    at org.apache.spark.mllib.feature.PCA.fit(PCA.scala:48)
    at org.apache.spark.ml.feature.PCA.fit(PCA.scala:99)
    at org.apache.spark.ml.feature.PCA.fit(PCA.scala:70)

My configuration is for 50 executors with 6GB/executor, so I don't think it's a matter of not having enough resources (and I don't see anything about resources here).

My input factors are a mixture of percentages, integers and 2-decimal floats, all positive and all ordinal. Could that be causing difficulty with convergence?

I had no trouble with the SKLearn method converging, and quickly, once I converted the PySpark DF to a Pandas DF.

来源：https://stackoverflow.com/questions/47340602/pyspark-pca-avoiding-notconvergedexception

标签

pyspark

pca

decomposition