How to create correct data frame for classification in Spark ML

前端 未结 3 1828
鱼传尺愫
鱼传尺愫 2020-12-04 10:05

I am trying to run random forest classification by using Spark ML api but I am having issues with creating right data frame input into pipeline.

Here is sample data

相关标签:
3条回答
  • 2020-12-04 10:30

    You simply need to make sure that you have a "features" column in your dataframe that is of type VectorUDF as show below:

    scala> val df2 = dataFixed.withColumnRenamed("age", "features")
    df2: org.apache.spark.sql.DataFrame = [features: int, hours_per_week: int, education: string, sex: string, salaryRange: string]
    
    scala> val cmModel = cv.fit(df2) 
    java.lang.IllegalArgumentException: requirement failed: Column features must be of type org.apache.spark.mllib.linalg.VectorUDT@1eef but was actually IntegerType.
        at scala.Predef$.require(Predef.scala:233)
        at org.apache.spark.ml.util.SchemaUtils$.checkColumnType(SchemaUtils.scala:37)
        at org.apache.spark.ml.PredictorParams$class.validateAndTransformSchema(Predictor.scala:50)
        at org.apache.spark.ml.Predictor.validateAndTransformSchema(Predictor.scala:71)
        at org.apache.spark.ml.Predictor.transformSchema(Predictor.scala:118)
        at org.apache.spark.ml.Pipeline$$anonfun$transformSchema$4.apply(Pipeline.scala:164)
        at org.apache.spark.ml.Pipeline$$anonfun$transformSchema$4.apply(Pipeline.scala:164)
        at scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:51)
        at scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:60)
        at scala.collection.mutable.ArrayOps$ofRef.foldLeft(ArrayOps.scala:108)
        at org.apache.spark.ml.Pipeline.transformSchema(Pipeline.scala:164)
        at org.apache.spark.ml.tuning.CrossValidator.transformSchema(CrossValidator.scala:142)
        at org.apache.spark.ml.PipelineStage.transformSchema(Pipeline.scala:59)
        at org.apache.spark.ml.tuning.CrossValidator.fit(CrossValidator.scala:107)
        at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:67)
        at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:72)
        at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:74)
        at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:76)
    

    EDIT1

    Essentially there need to be two fields in your data frame "features" for feature vector and "label" for instance labels. Instance must be of type Double.

    To create a "features" fields with Vector type first create a udf as show below:

    val toVec4    = udf[Vector, Int, Int, String, String] { (a,b,c,d) => 
      val e3 = c match {
        case "hs-grad" => 0
        case "bachelors" => 1
        case "masters" => 2
      }
      val e4 = d match {case "male" => 0 case "female" => 1}
      Vectors.dense(a, b, e3, e4) 
    }
    

    Now to also encode the "label" field, create another udf as shown below:

    val encodeLabel    = udf[Double, String]( _ match { case "A" => 0.0 case "B" => 1.0} )
    

    Now we transform original dataframe using these two udf:

    val df = dataFixed.withColumn(
      "features",
      toVec4(
        dataFixed("age"),
        dataFixed("hours_per_week"),
        dataFixed("education"),
        dataFixed("sex")
      )
    ).withColumn("label", encodeLabel(dataFixed("salaryRange"))).select("features", "label")
    

    Note that there can be extra columns / fields present in the dataframe, but in this case I have selected only features and label:

    scala> df.show()
    +-------------------+-----+
    |           features|label|
    +-------------------+-----+
    |[38.0,40.0,0.0,0.0]|  0.0|
    |[28.0,40.0,1.0,1.0]|  0.0|
    |[52.0,45.0,0.0,0.0]|  1.0|
    |[31.0,50.0,2.0,1.0]|  1.0|
    |[42.0,40.0,1.0,0.0]|  1.0|
    +-------------------+-----+
    

    Now its upto you to set correct parameters for your learning algorithm to make it work.

    0 讨论(0)
  • 2020-12-04 10:30

    As of Spark 1.4, you can use Transformer org.apache.spark.ml.feature.VectorAssembler. Just provide column names you want to be features.

    val assembler = new VectorAssembler()
      .setInputCols(Array("col1", "col2", "col3"))
      .setOutputCol("features")
    

    and add it to your pipeline.

    0 讨论(0)
  • 2020-12-04 10:32

    According to spark documentation on mllib - random trees, seems to me that you should define the features map that you are using and the points should be a labeledpoint.

    This will tell the algorithm which column should be used as prediction and which ones are the features.

    https://spark.apache.org/docs/latest/mllib-decision-tree.html

    0 讨论(0)
提交回复
热议问题