How to Remove header and footer from Dataframe?

荒凉一梦 提交于 2019-12-02 10:29:49

问题


I am reading a text (not CSV) file that has header, content and footer using

spark.read.format("text").option("delimiter","|")...load(file)

I can access the header with df.first(). Is there something close to df.last() or df.reverse().first()?


回答1:


Sample data:

col1|col2|col3
100|hello|asdf
300|hi|abc
200|bye|xyz
800|ciao|qwerty
This is the footer line

Processing logic:

#load text file
txt = sc.textFile("path_to_above_sample_data_text_file.txt")

#remove header
header = txt.first()
txt = txt.filter(lambda line: line != header)

#remove footer
txt = txt.map(lambda line: line.split("|"))\
    .filter(lambda line: len(line)>1)

#convert to dataframe
df=txt.toDF(header.split("|"))
df.show()

Output is:

+----+-----+------+
|col1| col2|  col3|
+----+-----+------+
| 100|hello|  asdf|
| 300|   hi|   abc|
| 200|  bye|   xyz|
| 800| ciao|qwerty|
+----+-----+------+


Hope this helps!




回答2:


assuming the file is not so large we can use collect to get the dataframe as iterator and the access the last element as follows:

df = df.collect()[data.count()-1]

avoid using collect on large datasets.

or

we can use take to cut off the last row.

df = df.take(data.count()-1)



回答3:


In addition to above answer, below solution fits good for files with multiple header and footer lines :-

val data_delimiter = "|"
val skipHeaderLines = 5
val skipHeaderLines = 3

//-- Read file into Dataframe and convert to RDD
val dataframe = spark.read.option("wholeFile", true).option("delimiter",data_delimiter).csv(s"hdfs://$in_data_file")

val rdd = dataframe.rdd

//-- RDD without header and footer
val dfRdd = rdd.zipWithIndex().filter({case (line, index) => index != (cnt - skipFooterLines) && index > (skipHeaderLines - 1)}).map({case (line, index) => line})

//-- Dataframe without header and footer
val df = spark.createDataFrame(dfRdd, dataframe.schema)

Hope this is helpful.



来源:https://stackoverflow.com/questions/47126936/how-to-remove-header-and-footer-from-dataframe

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!