Reading a text file with multiple headers in Spark

纵饮孤独 提交于 2020-02-08 10:02:14

问题


I have a text file having multiple headers where "TEMP" column has the average temperature for the day, followed by the number of recordings. How can I read this text file properly to create a DataFrame

STN--- WBAN   YEARMODA    TEMP     
010010 99999  20060101    33.5 23
010010 99999  20060102    35.3 23
010010 99999  20060103    34.4 24
STN--- WBAN   YEARMODA    TEMP     
010010 99999  20060120    35.2 22
010010 99999  20060121    32.2 21
010010 99999  20060122    33.0 22

回答1:


  1. You can read the text file as a normal text file in an RDD
  2. You have a separator in the text file, let's assume it's a space
  3. Then you can remove the header from it
  4. Remove all lines inequal to the header
  5. Then convert the RDD to a dataframe using .toDF(col_names)

Like this:

rdd = sc.textFile("path/to/file.txt").map(lambda x: x.split(" ")) # step 1 & 2
headers = rdd.first() # Step 3
rdd2 = rdd.filter(lambda x: x != headers)
df = rdd2.toDF(headers) # Step 4



回答2:


You can try this out. I have tried on console.

val x = sc.textFile("hdfs path of text file")
val header = x.first()
var y = x.filter(x=>(!x.contains("STN"))) //this will remove all the line 
var df = y.toDF(header)

Hope this will works for you.



来源:https://stackoverflow.com/questions/59066489/reading-a-text-file-with-multiple-headers-in-spark

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!