Spark - create RDD of (label, features) pairs from CSV file

泄露秘密 提交于 2019-12-06 09:18:29

After lots of efforts I found out the solution. The first problem was related to the header rows and the second was related to mapping function. Here is the complete solution:

//To read the file
val csv = sc.textFile(datadir + "/data_2.csv");

//To find the headers
val header = csv.first;

//To remove the header
val data = csv.filter(_(0) != header(0));

//To create a RDD of (label, features) pairs
val parsedData = data.map { line =>
    val parts = line.split(',')
    LabeledPoint(parts(0).toDouble, Vectors.dense(parts(1).split(' ').map(_.toDouble)))
    }.cache()

I hope it can save your time.

When you read in your file the first line

y_3,x_6,x_7,x_73_1,x_73_2,x_73_3,x_8

Is also read and transformed in your map function so you're trying to call toDouble on y_3. You need to filter out the first row and do the learning using the remaining rows.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!