Reading parquet files from multiple directories in Pyspark

I need to read parquet files from multiple paths that are not parent or child directories.

for example,

dir1 ---
       |
       ------- dir1_1
       |
       ------- dir1_2
dir2 ---
       |
       ------- dir2_1
       |
       ------- dir2_2

sqlContext.read.parquet(dir1) reads parquet files from dir1_1 and dir1_2

Right now I'm reading each dir and merging dataframes using "unionAll". Is there a way to read parquet files from dir1_2 and dir2_1 without using unionAll or is there any fancy way using unionAll

Thanks

A little late but I found this while I was searching and it may help someone else...

You might also try unpacking the argument list to spark.read.parquet()

paths=['foo','bar']
df=spark.read.parquet(*paths)

This is convenient if you want to pass a few blobs into the path argument:

basePath='s3://bucket/'
paths=['s3://bucket/partition_value1=*/partition_value2=2017-04-*',
       's3://bucket/partition_value1=*/partition_value2=2017-05-*'
      ]
df=spark.read.option("basePath",basePath).parquet(*paths)

This is cool cause you don't need to list all the files in the basePath, and you still get partition inference.

Both the parquetFile method of SQLContext and the parquet method of DataFrameReader take multiple paths. So either of these works:

df = sqlContext.parquetFile('/dir1/dir1_2', '/dir2/dir2_1')

df = sqlContext.read.parquet('/dir1/dir1_2', '/dir2/dir2_1')

Just taking John Conley's answer, and embellishing it a bit and providing the full code (used in Jupyter PySpark) as I found his answer extremely useful.

from hdfs import InsecureClient
client = InsecureClient('http://localhost:50070')

import posixpath as psp
fpaths = [
  psp.join("hdfs://localhost:9000" + dpath, fname)
  for dpath, _, fnames in client.walk('/eta/myHdfsPath')
  for fname in fnames
]
# At this point fpaths contains all hdfs files 

parquetFile = sqlContext.read.parquet(*fpaths)


import pandas
pdf = parquetFile.toPandas()
# display the contents nicely formatted.
pdf

来源：https://stackoverflow.com/questions/37257111/reading-parquet-files-from-multiple-directories-in-pyspark

标签

pyspark

parquet