I'm using Spark 2.1.1 and Scala 2.11.8.
I have to read data from a csv file with columns ranging from minimum 6 to maximum of 8. I have to split the 9 entries and once split, columns 0 to 5 will always have data. However data can either be present or absent in columns 6 to 8. I separated and stored the required columns in a RDD using:
val read_file = sc.textFile("Path to input file"); val uid = read_file.map(line => {var arr = line.split(","); (arr(2).split(":")(0),arr(3),arr(4).split(":")(0),arr(5).split(":")(0),arr(6).split(":")(0),arr(7).split(":")(0),arr(8).split(":")(0))})
Now, in the RDD 'uid' obtained, columns 0 to 3 will always be filled but 4 to 7 may or may not have data. Eg: The csv file from which I'm reading the data,
2017-05-09 21:52:42 , 1494391962 , p69465323_serv80i:10:450 , 7 , fb_406423006398063:396560, guest_861067032060185_android:671051, fb_100000829486587:186589, fb_100007900293502:407374, fb_172395756592775:649795 2017-05-09 21:52:42 , 1494391962 , z67265107_serv77i:4:45 , 2:Re , fb_106996523208498:110066, fb_274049626104849:86632, fb_111857069377742:69348, fb_127277511127344:46246 2017-05-09 21:52:42 , 1494391962 , v73392772_serv33i:9:1400 , 1:4x , c2eb11fd-99dc-4dee-a75c-bc9bfd2e0ae4iphone:314129, fb_217409795286934:294262
As it can be seen, the first entry has all 9 columns filled, the second entry has 8 filled and the 3rd entry has only 6 columns filled.
From the RDD obtained, I have to map column arr(1)(0) with columns arr(3)(0) to arr(7)(0).The mapping of column 1 should be done only with filled columns from 3 to 7. Empty columns between 3 to 7 do not have to be mapped with column 1. I was trying to do this using for loop:
Once I have this after executing the statement val uid = read_file.map():
(String, String, String, String, String, String, String) = (" p69465323_serv80i"," 7 "," fb_406423006398063"," guest_861067032060185_android"," fb_100000829486587"," fb_100007900293502"," fb_172395756592775")
I do:
for (var x <= 5 to 7) { if var arr => (arr(x) != null) { val pairedRdd = uid.map(x => ((x._1, x._3), (x._1, x._4), (x._1, x._5), (x._1, x._6), (x._1, x._7)) ) }
This will work for the first statement in the example of the data given but not the second and third.
The logic is wrong, I admit but it's only to convey an idea of what I'm trying to do.
P.S : Use of Spark SQL is not allowed.