Get HDFS file path in PySpark for files in sequence file format

我的未来我决定 提交于 2021-01-24 07:09:23

问题


My data on HDFS is in Sequence file format. I am using PySpark (Spark 1.6) and trying to achieve 2 things:

  1. Data path contains a timestamp in yyyy/mm/dd/hh format that I would like to bring into the data itself. I tried SparkContext.wholeTextFiles but I think that might not support Sequence file format.

  2. How do I deal with the point above if I want to crunch data for a day and want to bring in the date into the data? In this case I would be loading data like yyyy/mm/dd/* format.

Appreciate any pointers.


回答1:


If stored types are compatible with SQL types and you use Spark 2.0 it is quite simple. Import input_file_name:

from pyspark.sql.functions import input_file_name 

Read file and convert to a DataFrame:

df = sc.sequenceFile("/tmp/foo/").toDF()

Add file name:

df.withColumn("input", input_file_name())

If this solution is not applicable in your case then universal one is to list files directly (for HDFS you can use hdfs3 library):

files = ...

read one by one adding file name:

def read(f):
    """Just to avoid problems with late binding"""
    return sc.sequenceFile(f).map(lambda x: (f, x))

rdds = [read(f) for f in files]

and union:

sc.union(rdds)


来源:https://stackoverflow.com/questions/40136944/get-hdfs-file-path-in-pyspark-for-files-in-sequence-file-format

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!