问题
I have a text file having multiple headers where "TEMP" column has the average temperature for the day, followed by the number of recordings. How can I read this text file properly to create a DataFrame
STN--- WBAN YEARMODA TEMP
010010 99999 20060101 33.5 23
010010 99999 20060102 35.3 23
010010 99999 20060103 34.4 24
STN--- WBAN YEARMODA TEMP
010010 99999 20060120 35.2 22
010010 99999 20060121 32.2 21
010010 99999 20060122 33.0 22
回答1:
- You can read the text file as a normal text file in an
RDD
- You have a separator in the text file, let's assume it's a
space
- Then you can remove the header from it
- Remove all lines inequal to the header
- Then convert the
RDD
to a dataframe using.toDF(col_names)
Like this:
rdd = sc.textFile("path/to/file.txt").map(lambda x: x.split(" ")) # step 1 & 2
headers = rdd.first() # Step 3
rdd2 = rdd.filter(lambda x: x != headers)
df = rdd2.toDF(headers) # Step 4
回答2:
You can try this out. I have tried on console.
val x = sc.textFile("hdfs path of text file")
val header = x.first()
var y = x.filter(x=>(!x.contains("STN"))) //this will remove all the line
var df = y.toDF(header)
Hope this will works for you.
来源:https://stackoverflow.com/questions/59066489/reading-a-text-file-with-multiple-headers-in-spark