How to Remove header and footer from Dataframe?

后端 未结 4 991
小蘑菇
小蘑菇 2021-01-24 07:23

I am reading a text (not CSV) file that has header, content and footer using

spark.read.format(\"text\").option(\"delimiter\",\"|\")...load(file)
4条回答
  •  渐次进展
    2021-01-24 08:16

    In addition to above answer, below solution fits good for files with multiple header and footer lines :-

    val data_delimiter = "|"
    val skipHeaderLines = 5
    val skipHeaderLines = 3
    
    //-- Read file into Dataframe and convert to RDD
    val dataframe = spark.read.option("wholeFile", true).option("delimiter",data_delimiter).csv(s"hdfs://$in_data_file")
    
    val rdd = dataframe.rdd
    
    //-- RDD without header and footer
    val dfRdd = rdd.zipWithIndex().filter({case (line, index) => index != (cnt - skipFooterLines) && index > (skipHeaderLines - 1)}).map({case (line, index) => line})
    
    //-- Dataframe without header and footer
    val df = spark.createDataFrame(dfRdd, dataframe.schema)
    

    Hope this is helpful.

提交回复
热议问题