问题
Please keep in mind I'm new to scala.
This is the example I am trying to follow: https://spark.apache.org/docs/1.5.1/ml-ann.html
It uses this dataset: https://github.com/apache/spark/blob/master/data/mllib/sample_multiclass_classification_data.txt
I have prepared my .csv using the code below to get a data frame for classification in Scala.
//imports for ML
import org.apache.spark.ml.classification.MultilayerPerceptronClassifier
import org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator
import org.apache.spark.mllib.util.MLUtils
import org.apache.spark.sql.Row
//imports for transformation
import sqlContext.implicits._
import com.databricks.spark.csv._
import org.apache.spark.mllib.linalg.{Vector, Vectors}
//load data
val data2 = sqlContext.csvFile("/Users/administrator/Downloads/ds_15k_10-2.csv")
//Rename any one column to features
//val df2 = data.withColumnRenamed("ip_crowding", "features")
val DF2 = data2.select("gst_id_matched","ip_crowding","lat_long_dist");
scala> DF2.take(2)
res6: Array[org.apache.spark.sql.Row] = Array([0,0,0], [0,0,1628859.542])
//define doublelfunc
val toDouble = udf[Double, String]( _.toDouble)
//Convert all to double
val featureDf = DF2
.withColumn("gst_id_matched",toDouble(DF2("gst_id_matched")))
.withColumn("ip_crowding",toDouble(DF2("ip_crowding")))
.withColumn("lat_long_dist",toDouble(DF2("lat_long_dist")))
.select("gst_id_matched","ip_crowding","lat_long_dist")
//Define the format
val toVec4 = udf[Vector, Double,Double] { (v1,v2) => Vectors.dense(v1,v2) }
//Format for features which is gst_id_matched
val encodeLabel = udf[Double, String]( _ match
{ case "0.0" => 0.0 case "1.0" => 1.0} )
//Transformed dataset
val df = featureDf
.withColumn("features",toVec4(featureDf("ip_crowding"),featureDf("lat_long_dist")))
.withColumn("label",encodeLabel(featureDf("gst_id_matched")))
.select("label", "features")
val splits = df.randomSplit(Array(0.6, 0.4), seed = 1234L)
val train = splits(0)
val test = splits(1)
// specify layers for the neural network:
// input layer of size 4 (features), two intermediate of size 5 and 4 and output of size 3 (classes)
val layers = Array[Int](0, 0, 0, 0)
// create the trainer and set its parameter
val trainer = new MultilayerPerceptronClassifier().setLayers(layers).setBlockSize(12).setSeed(1234L).setMaxIter(10)
// train the model
val model = trainer.fit(train)
The last line generates this error
15/11/21 22:46:23 ERROR Executor: Exception in task 1.0 in stage 11.0 (TID 15)
java.lang.ArrayIndexOutOfBoundsException: 0
My suspicions:
When I examine the dataset,it looks fine for classification
scala> df.take(2)
res3: Array[org.apache.spark.sql.Row] = Array([0.0,[0.0,0.0]], [0.0,[0.0,1628859.542]])
But the apache example dataset is different and my transformation does not give me what I need.Can some one please help me with the dataset transformation or understand the root cause of the problem.
This is what the apache dataset looks like:
scala> data.take(1)
res8: Array[org.apache.spark.sql.Row] = Array([1.0,(4,[0,1,2,3],[-0.222222,0.5,-0.762712,-0.833333])])
回答1:
The source of your problems is a wrong definition of layers. When you use
val layers = Array[Int](0, 0, 0, 0)
it means you want a network with zero nodes in each layer which simply doesn't make sense. Generally speaking number of neurons in the input layer should be equal to the number of features and each hidden layer should contain at least one neuron.
Lets recreate your data simpling your code on the way:
import org.apache.spark.sql.functions.col
val df = sc.parallelize(Seq(
("0", "0", "0"), ("0", "0", "1628859.542")
)).toDF("gst_id_matched", "ip_crowding", "lat_long_dist")
Convert all columns to doubles:
val numeric = df
.select(df.columns.map(c => col(c).cast("double").alias(c)): _*)
.withColumnRenamed("gst_id_matched", "label")
Assemble features:
import org.apache.spark.ml.feature.VectorAssembler
val assembler = new VectorAssembler()
.setInputCols(Array("ip_crowding","lat_long_dist"))
.setOutputCol("features")
val data = assembler.transform(numeric)
data.show
// +-----+-----------+-------------+-----------------+
// |label|ip_crowding|lat_long_dist| features|
// +-----+-----------+-------------+-----------------+
// | 0.0| 0.0| 0.0| (2,[],[])|
// | 0.0| 0.0| 1628859.542|[0.0,1628859.542]|
// +-----+-----------+-------------+-----------------+
Train and test network:
import org.apache.spark.ml.classification.MultilayerPerceptronClassifier
val layers = Array[Int](2, 3, 5, 3) // Note 2 neurons in the input layer
val trainer = new MultilayerPerceptronClassifier()
.setLayers(layers)
.setBlockSize(128)
.setSeed(1234L)
.setMaxIter(100)
val model = trainer.fit(data)
model.transform(data).show
// +-----+-----------+-------------+-----------------+----------+
// |label|ip_crowding|lat_long_dist| features|prediction|
// +-----+-----------+-------------+-----------------+----------+
// | 0.0| 0.0| 0.0| (2,[],[])| 0.0|
// | 0.0| 0.0| 1628859.542|[0.0,1628859.542]| 0.0|
// +-----+-----------+-------------+-----------------+----------+
来源:https://stackoverflow.com/questions/33844591/prepare-data-for-multilayerperceptronclassifier-in-scala