问题
I have a CSV file and want to perform a simple LinearRegressionWithSGD on the data.
A sample data is as follow (the total rows in the file is 99 including labels) and the objective is to predict the y_3 variable:
y_3,x_6,x_7,x_73_1,x_73_2,x_73_3,x_8
2995.3846153846152,17.0,1800.0,0.0,1.0,0.0,12.0
2236.304347826087,17.0,1432.0,1.0,0.0,0.0,12.0
2001.9512195121952,35.0,1432.0,0.0,1.0,0.0,5.0
992.4324324324324,17.0,1430.0,1.0,0.0,0.0,12.0
4386.666666666667,26.0,1430.0,0.0,0.0,1.0,25.0
1335.9036144578313,17.0,1432.0,0.0,1.0,0.0,5.0
1097.560975609756,17.0,1100.0,0.0,1.0,0.0,5.0
3526.6666666666665,26.0,1432.0,0.0,1.0,0.0,12.0
506.8421052631579,17.0,1430.0,1.0,0.0,0.0,5.0
2095.890410958904,35.0,1430.0,1.0,0.0,0.0,12.0
720.0,35.0,1430.0,1.0,0.0,0.0,5.0
2416.5,17.0,1432.0,0.0,0.0,1.0,12.0
3306.6666666666665,35.0,1800.0,0.0,0.0,1.0,12.0
6105.974025974026,35.0,1800.0,1.0,0.0,0.0,25.0
1400.4624277456646,35.0,1800.0,1.0,0.0,0.0,5.0
1414.5454545454545,26.0,1430.0,1.0,0.0,0.0,12.0
5204.68085106383,26.0,1800.0,0.0,0.0,1.0,25.0
1812.2222222222222,17.0,1800.0,1.0,0.0,0.0,12.0
2763.5928143712576,35.0,1100.0,1.0,0.0,0.0,12.0
I already read the data with the following command:
val data = sc.textFile(datadir + "/data_2.csv");
When I want to create a RDD of (label, features) pairs with the following command:
val parsedData = data.map { line =>
val parts = line.split(',')
LabeledPoint(parts(0).toDouble, Vectors.dense(parts(1).split(' ').map(_.toDouble)))
}.cache()

So I can not continue for training a model, any help?
P.S. I run the spark with Scala IDE in Windows 7 x64.
回答1:
After lots of efforts I found out the solution. The first problem was related to the header rows and the second was related to mapping function. Here is the complete solution:
//To read the file
val csv = sc.textFile(datadir + "/data_2.csv");
//To find the headers
val header = csv.first;
//To remove the header
val data = csv.filter(_(0) != header(0));
//To create a RDD of (label, features) pairs
val parsedData = data.map { line =>
val parts = line.split(',')
LabeledPoint(parts(0).toDouble, Vectors.dense(parts(1).split(' ').map(_.toDouble)))
}.cache()
I hope it can save your time.
回答2:
When you read in your file the first line
y_3,x_6,x_7,x_73_1,x_73_2,x_73_3,x_8
Is also read and transformed in your map
function so you're trying to call toDouble
on y_3
. You need to filter out the first row and do the learning using the remaining rows.
来源:https://stackoverflow.com/questions/30298523/spark-create-rdd-of-label-features-pairs-from-csv-file