Read few parquet files at the same time in Spark

I can read few json-files at the same time using * (star):

sqlContext.jsonFile('/path/to/dir/*.json')

Is there any way to do the same thing for parquet? Star doesn't works.

See this issue on the spark jira. It is supported from 1.4 onwards.

Without upgrading to 1.4, you could either point at the top level directory:

sqlContext.parquetFile('/path/to/dir/')

which will load all files in the directory. Alternatively, you could use the HDFS API to find the files you want, and pass them to parquetFile (it accepts varargs).

FYI, you can also:

read subset of parquet files using the wildcard symbol * sqlContext.read.parquet("/path/to/dir/part_*.gz")
read multiple parquet files by explicitly specifying them sqlContext.read.parquet("/path/to/dir/part_1.gz", "/path/to/dir/part_2.gz")

user6602391

InputPath = [hdfs_path + "parquets/date=18-07-23/hour=2*/*.parquet",
             hdfs_path + "parquets/date=18-07-24/hour=0*/*.parquet"]

df = spark.read.parquet(*InputPath)

Idrees

For Read: Give the file's path and '*'

Example

pqtDF=sqlContext.read.parquet("Path_*.parquet")

来源：https://stackoverflow.com/questions/30421162/read-few-parquet-files-at-the-same-time-in-spark

标签

apache-spark

parquet

易学教程内所有资源均来自网络或用户发布的内容，如有违反法律规定的内容欢迎反馈！
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!