可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效，请关闭广告屏蔽插件后再试):

问题:

I'm using Spark 2.1.1 and Scala 2.11.8.

I have to read data from a csv file with columns ranging from minimum 6 to maximum of 8. I have to split the 9 entries and once split, columns 0 to 5 will always have data. However data can either be present or absent in columns 6 to 8. I separated and stored the required columns in a RDD using:

val read_file = sc.textFile("Path to input file");  val uid = read_file.map(line => {var arr = line.split(","); (arr(2).split(":")(0),arr(3),arr(4).split(":")(0),arr(5).split(":")(0),arr(6).split(":")(0),arr(7).split(":")(0),arr(8).split(":")(0))})

Now, in the RDD 'uid' obtained, columns 0 to 3 will always be filled but 4 to 7 may or may not have data. Eg: The csv file from which I'm reading the data,

2017-05-09 21:52:42 , 1494391962 , p69465323_serv80i:10:450 , 7 , fb_406423006398063:396560, guest_861067032060185_android:671051, fb_100000829486587:186589, fb_100007900293502:407374, fb_172395756592775:649795  2017-05-09 21:52:42 , 1494391962 , z67265107_serv77i:4:45 , 2:Re , fb_106996523208498:110066, fb_274049626104849:86632, fb_111857069377742:69348, fb_127277511127344:46246  2017-05-09 21:52:42 , 1494391962 , v73392772_serv33i:9:1400 , 1:4x , c2eb11fd-99dc-4dee-a75c-bc9bfd2e0ae4iphone:314129, fb_217409795286934:294262

As it can be seen, the first entry has all 9 columns filled, the second entry has 8 filled and the 3rd entry has only 6 columns filled.

From the RDD obtained, I have to map column arr(1)(0) with columns arr(3)(0) to arr(7)(0).The mapping of column 1 should be done only with filled columns from 3 to 7. Empty columns between 3 to 7 do not have to be mapped with column 1. I was trying to do this using for loop:

Once I have this after executing the statement val uid = read_file.map():

(String, String, String, String, String, String, String) = (" p69465323_serv80i"," 7 "," fb_406423006398063"," guest_861067032060185_android"," fb_100000829486587"," fb_100007900293502"," fb_172395756592775")

I do:

for (var x <= 5 to 7) { if var arr => (arr(x) != null) { val pairedRdd = uid.map(x => ((x._1, x._3), (x._1, x._4), (x._1, x._5), (x._1, x._6), (x._1, x._7)) ) }

This will work for the first statement in the example of the data given but not the second and third.

The logic is wrong, I admit but it's only to convey an idea of what I'm trying to do.

P.S : Use of Spark SQL is not allowed.

回答1:

you can do the following

val read_file = sc.textFile("Path to input file") val uid = read_file.map(line => line.split(",")).map(array => array.map(arr => {     if(arr.contains(":")) (array(2).split(":")(0), arr.split(":")(0))     else (array(2).split(":")(0), arr) }))

Now doing

uid.map(array => array.drop(2)).map(array => array.toSeq)

would give you rdd as

WrappedArray(( p69465323_serv80i, p69465323_serv80i), ( p69465323_serv80i, 7 ), ( p69465323_serv80i, fb_406423006398063), ( p69465323_serv80i, guest_861067032060185_android), ( p69465323_serv80i, fb_100000829486587), ( p69465323_serv80i, fb_100007900293502), ( p69465323_serv80i, fb_172395756592775)) WrappedArray(( z67265107_serv77i, z67265107_serv77i), ( z67265107_serv77i, 2), ( z67265107_serv77i, fb_106996523208498), ( z67265107_serv77i, fb_274049626104849), ( z67265107_serv77i, fb_111857069377742), ( z67265107_serv77i, fb_127277511127344)) WrappedArray(( v73392772_serv33i, v73392772_serv33i), ( v73392772_serv33i, 1), ( v73392772_serv33i, c2eb11fd-99dc-4dee-a75c-bc9bfd2e0ae4iphone), ( v73392772_serv33i, fb_217409795286934))

Whereas doing

uid.map(array => array.drop(2)).flatMap(array => array)

would give you rdd as

( p69465323_serv80i, p69465323_serv80i) ( p69465323_serv80i, 7 ) ( p69465323_serv80i, fb_406423006398063) ( p69465323_serv80i, guest_861067032060185_android) ( p69465323_serv80i, fb_100000829486587) ( p69465323_serv80i, fb_100007900293502) ( p69465323_serv80i, fb_172395756592775) ( z67265107_serv77i, z67265107_serv77i) ( z67265107_serv77i, 2) ( z67265107_serv77i, fb_106996523208498) ( z67265107_serv77i, fb_274049626104849) ( z67265107_serv77i, fb_111857069377742) ( z67265107_serv77i, fb_127277511127344) ( v73392772_serv33i, v73392772_serv33i) ( v73392772_serv33i, 1) ( v73392772_serv33i, c2eb11fd-99dc-4dee-a75c-bc9bfd2e0ae4iphone) ( v73392772_serv33i, fb_217409795286934)

The choice is yours

文章来源: How to identify null fields in a csv file?

标签

string

csv

array