Reading parquet files from multiple directories in Pyspark

后端 未结 5 2117
北恋
北恋 2020-12-03 15:21

I need to read parquet files from multiple paths that are not parent or child directories.

for example,

dir1 ---
       |
       ------- dir1_1
            


        
5条回答
  •  天涯浪人
    2020-12-03 15:41

    A little late but I found this while I was searching and it may help someone else...

    You might also try unpacking the argument list to spark.read.parquet()

    paths=['foo','bar']
    df=spark.read.parquet(*paths)
    

    This is convenient if you want to pass a few blobs into the path argument:

    basePath='s3://bucket/'
    paths=['s3://bucket/partition_value1=*/partition_value2=2017-04-*',
           's3://bucket/partition_value1=*/partition_value2=2017-05-*'
          ]
    df=spark.read.option("basePath",basePath).parquet(*paths)
    

    This is cool cause you don't need to list all the files in the basePath, and you still get partition inference.

提交回复
热议问题