How to Remove header and footer from Dataframe?

后端 未结 4 996
小蘑菇
小蘑菇 2021-01-24 07:23

I am reading a text (not CSV) file that has header, content and footer using

spark.read.format(\"text\").option(\"delimiter\",\"|\")...load(file)
4条回答
  •  北荒
    北荒 (楼主)
    2021-01-24 08:24

    Sample data:

    col1|col2|col3
    100|hello|asdf
    300|hi|abc
    200|bye|xyz
    800|ciao|qwerty
    This is the footer line
    

    Processing logic:

    #load text file
    txt = sc.textFile("path_to_above_sample_data_text_file.txt")
    
    #remove header
    header = txt.first()
    txt = txt.filter(lambda line: line != header)
    
    #remove footer
    txt = txt.map(lambda line: line.split("|"))\
        .filter(lambda line: len(line)>1)
    
    #convert to dataframe
    df=txt.toDF(header.split("|"))
    df.show()
    

    Output is:

    +----+-----+------+
    |col1| col2|  col3|
    +----+-----+------+
    | 100|hello|  asdf|
    | 300|   hi|   abc|
    | 200|  bye|   xyz|
    | 800| ciao|qwerty|
    +----+-----+------+
    


    Hope this helps!

提交回复
热议问题