Spark 2.1.0, ML RandomForest: java.lang.UnsupportedOperationException: empty.maxBy

问题

I am trying to fit an ML Cross Validator on a DataFrame of the following Schema:

root
 |-- userID: string (nullable = true)
 |-- features: vector (nullable = true)
 |-- label: double (nullable = true)

I am getting a java.lang.UnsupportedOperationException: empty.maxBy when I fit the CrossValidator.

I have read this bug report, it says that this exception happens there is no feautres:

In the case of empty features we fail with a better error message stating: DecisionTree requires number of features > 0, but was given an empty features vector Instead of the cryptic error message: java.lang.UnsupportedOperationException: empty.max

In my case, I do have thousands of features, so I am sure that the features DataFrame is not empty.

What could be another reason for this exception?

I am running the cluster on EMR, and here is the code if that helps (the DataFrame name is featuresDF, and before I fit the CrossValidator I verified that there is no empty features):

val rf = new RandomForestClassifier()
                        .setLabelCol("label")
                        .setFeaturesCol("features")

val pipeline = new Pipeline().setStages(Array(rf))

val paramGrid = new ParamGridBuilder()
                    .addGrid(rf.numTrees, Array(500, 1000))
                    .addGrid(rf.maxDepth, Array(15, 25))
                    .build()

val evaluator = new BinaryClassificationEvaluator()
                    .setLabelCol("label")
                    .setMetricName("areaUnderPR")

val cv = new CrossValidator()
                    .setEstimator(pipeline)
                    .setEvaluator(evaluator)
                    .setEstimatorParamMaps(paramGrid)
                    .setNumFolds(3)

val model = cv.fit(featuresDF)

来源：https://stackoverflow.com/questions/44024076/spark-2-1-0-ml-randomforest-java-lang-unsupportedoperationexception-empty-max

标签

apache-spark

apache-spark-ml