Reading parquet files from multiple directories in Pyspark

后端 未结 5 2111
北恋
北恋 2020-12-03 15:21

I need to read parquet files from multiple paths that are not parent or child directories.

for example,

dir1 ---
       |
       ------- dir1_1
            


        
5条回答
  •  抹茶落季
    2020-12-03 15:54

    Just taking John Conley's answer, and embellishing it a bit and providing the full code (used in Jupyter PySpark) as I found his answer extremely useful.

    from hdfs import InsecureClient
    client = InsecureClient('http://localhost:50070')
    
    import posixpath as psp
    fpaths = [
      psp.join("hdfs://localhost:9000" + dpath, fname)
      for dpath, _, fnames in client.walk('/eta/myHdfsPath')
      for fname in fnames
    ]
    # At this point fpaths contains all hdfs files 
    
    parquetFile = sqlContext.read.parquet(*fpaths)
    
    
    import pandas
    pdf = parquetFile.toPandas()
    # display the contents nicely formatted.
    pdf
    

提交回复
热议问题