How to skip more then one lines of header in RDD in Spark

后端未结

关注

 3  1633

攒了一身酷 2021-01-14 16:15

Data in my first RDD is like

Now the first 3 integers are some counters that I need to bro

3条回答

Happy的楠姐 (楼主)

2021-01-14 16:44

In my case I have a csv file like below

----- HEADER START -----
We love to generate headers
#who needs comment char?
----- HEADER END -----

colName1,colName2,...,colNameN
val__1.1,val__1.2,...,val__1.N

Took me a day to figure out

val rdd = spark.read.textFile(pathToFile)  .rdd
  .zipWithIndex() // get tuples (line, Index)
  .filter({case (line, index) => index > numberOfLinesToSkip})
  .map({case (line, index) => l}) //get rid of index
val ds = spark.createDataset(rdd) //convert rdd to dataset
val df=spark.read.option("inferSchema", "true").option("header", "true").csv(ds) //parse csv

Sorry code in scala, however can be easily converted to python

0 讨论(0)

查看其它3个回答