Prepare data for MultilayerPerceptronClassifier in scala

问题

Please keep in mind I'm new to scala.

This is the example I am trying to follow: https://spark.apache.org/docs/1.5.1/ml-ann.html

It uses this dataset: https://github.com/apache/spark/blob/master/data/mllib/sample_multiclass_classification_data.txt

I have prepared my .csv using the code below to get a data frame for classification in Scala.

//imports for ML
import org.apache.spark.ml.classification.MultilayerPerceptronClassifier
import org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator
import org.apache.spark.mllib.util.MLUtils
import org.apache.spark.sql.Row

//imports for transformation
import sqlContext.implicits._
import com.databricks.spark.csv._
import org.apache.spark.mllib.linalg.{Vector, Vectors}

//load data
val data2 = sqlContext.csvFile("/Users/administrator/Downloads/ds_15k_10-2.csv")

//Rename any one column to features
//val df2 = data.withColumnRenamed("ip_crowding", "features")
val DF2 = data2.select("gst_id_matched","ip_crowding","lat_long_dist");

scala> DF2.take(2)
res6: Array[org.apache.spark.sql.Row] = Array([0,0,0], [0,0,1628859.542])

//define doublelfunc
val toDouble = udf[Double, String]( _.toDouble)

//Convert all to double
val featureDf = DF2
.withColumn("gst_id_matched",toDouble(DF2("gst_id_matched")))
.withColumn("ip_crowding",toDouble(DF2("ip_crowding")))
.withColumn("lat_long_dist",toDouble(DF2("lat_long_dist")))
.select("gst_id_matched","ip_crowding","lat_long_dist")


//Define the format
val toVec4 = udf[Vector, Double,Double] { (v1,v2) => Vectors.dense(v1,v2) }

//Format for features which is gst_id_matched
val encodeLabel   = udf[Double, String]( _ match 
{ case "0.0" => 0.0 case "1.0" => 1.0} )

//Transformed dataset
    val df = featureDf
.withColumn("features",toVec4(featureDf("ip_crowding"),featureDf("lat_long_dist")))
.withColumn("label",encodeLabel(featureDf("gst_id_matched")))
.select("label", "features")

val splits = df.randomSplit(Array(0.6, 0.4), seed = 1234L)
val train = splits(0)
val test = splits(1)
// specify layers for the neural network: 
// input layer of size 4 (features), two intermediate of size 5 and 4 and output of size 3 (classes)
val layers = Array[Int](0, 0, 0, 0)
// create the trainer and set its parameter


val trainer = new MultilayerPerceptronClassifier().setLayers(layers).setBlockSize(12).setSeed(1234L).setMaxIter(10)
// train the model
val model = trainer.fit(train)

The last line generates this error

15/11/21 22:46:23 ERROR Executor: Exception in task 1.0 in stage 11.0 (TID 15)
java.lang.ArrayIndexOutOfBoundsException: 0

My suspicions:

When I examine the dataset,it looks fine for classification

scala> df.take(2)
res3: Array[org.apache.spark.sql.Row] = Array([0.0,[0.0,0.0]], [0.0,[0.0,1628859.542]])

But the apache example dataset is different and my transformation does not give me what I need.Can some one please help me with the dataset transformation or understand the root cause of the problem.

This is what the apache dataset looks like:

scala> data.take(1)
res8: Array[org.apache.spark.sql.Row] = Array([1.0,(4,[0,1,2,3],[-0.222222,0.5,-0.762712,-0.833333])])

回答1:

The source of your problems is a wrong definition of layers. When you use

val layers = Array[Int](0, 0, 0, 0)

it means you want a network with zero nodes in each layer which simply doesn't make sense. Generally speaking number of neurons in the input layer should be equal to the number of features and each hidden layer should contain at least one neuron.

Lets recreate your data simpling your code on the way:

import org.apache.spark.sql.functions.col

val df = sc.parallelize(Seq(
  ("0", "0", "0"), ("0", "0", "1628859.542")
)).toDF("gst_id_matched", "ip_crowding", "lat_long_dist")

Convert all columns to doubles:

val numeric = df
  .select(df.columns.map(c => col(c).cast("double").alias(c)): _*)
  .withColumnRenamed("gst_id_matched", "label")

Assemble features:

import org.apache.spark.ml.feature.VectorAssembler

val assembler = new VectorAssembler()
  .setInputCols(Array("ip_crowding","lat_long_dist"))
  .setOutputCol("features")

val data = assembler.transform(numeric)
data.show

// +-----+-----------+-------------+-----------------+
// |label|ip_crowding|lat_long_dist|         features|
// +-----+-----------+-------------+-----------------+
// |  0.0|        0.0|          0.0|        (2,[],[])|
// |  0.0|        0.0|  1628859.542|[0.0,1628859.542]|
// +-----+-----------+-------------+-----------------+

Train and test network:

import org.apache.spark.ml.classification.MultilayerPerceptronClassifier

val layers = Array[Int](2, 3, 5, 3) // Note 2 neurons in the input layer
val trainer = new MultilayerPerceptronClassifier()
  .setLayers(layers)
  .setBlockSize(128)
  .setSeed(1234L)
  .setMaxIter(100)

val model = trainer.fit(data)
model.transform(data).show

// +-----+-----------+-------------+-----------------+----------+
// |label|ip_crowding|lat_long_dist|         features|prediction|
// +-----+-----------+-------------+-----------------+----------+
// |  0.0|        0.0|          0.0|        (2,[],[])|       0.0|
// |  0.0|        0.0|  1628859.542|[0.0,1628859.542]|       0.0|
// +-----+-----------+-------------+-----------------+----------+

来源：https://stackoverflow.com/questions/33844591/prepare-data-for-multilayerperceptronclassifier-in-scala

标签

scala

apache-spark

transformation