I need to read parquet files from multiple paths that are not parent or child directories.
for example,
dir1 ---
|
------- dir1_1
Just taking John Conley's answer, and embellishing it a bit and providing the full code (used in Jupyter PySpark) as I found his answer extremely useful.
from hdfs import InsecureClient
client = InsecureClient('http://localhost:50070')
import posixpath as psp
fpaths = [
psp.join("hdfs://localhost:9000" + dpath, fname)
for dpath, _, fnames in client.walk('/eta/myHdfsPath')
for fname in fnames
]
# At this point fpaths contains all hdfs files
parquetFile = sqlContext.read.parquet(*fpaths)
import pandas
pdf = parquetFile.toPandas()
# display the contents nicely formatted.
pdf