How to skip more then one lines of header in RDD in Spark

后端 未结 3 1633
攒了一身酷
攒了一身酷 2021-01-14 16:15

Data in my first RDD is like

1253
545553
12344896
1 2 1
1 43 2
1 46 1
1 53 2

Now the first 3 integers are some counters that I need to bro

3条回答
  •  Happy的楠姐
    2021-01-14 16:44

    In my case I have a csv file like below

    ----- HEADER START -----
    We love to generate headers
    #who needs comment char?
    ----- HEADER END -----
    
    colName1,colName2,...,colNameN
    val__1.1,val__1.2,...,val__1.N
    

    Took me a day to figure out

    val rdd = spark.read.textFile(pathToFile)  .rdd
      .zipWithIndex() // get tuples (line, Index)
      .filter({case (line, index) => index > numberOfLinesToSkip})
      .map({case (line, index) => l}) //get rid of index
    val ds = spark.createDataset(rdd) //convert rdd to dataset
    val df=spark.read.option("inferSchema", "true").option("header", "true").csv(ds) //parse csv
    

    Sorry code in scala, however can be easily converted to python

提交回复
热议问题