Reading DataFrame from partitioned parquet file

后端 未结 3 1094
离开以前
离开以前 2020-12-04 18:10

How to read partitioned parquet with condition as dataframe,

this works fine,

val dataframe = sqlContext.read.parquet(\"file:///home/msoproj/dev_data         


        
相关标签:
3条回答
  • 2020-12-04 18:46

    you need to provide mergeSchema = true option. like mentioned below (this is from 1.6.0):

    val dataframe = sqlContext.read.option("mergeSchema", "true").parquet("file:///your/path/data=jDD")
    

    This will read all the parquet files into dataframe and also creates columns year, month and day in the dataframe data.

    Ref: https://spark.apache.org/docs/1.6.0/sql-programming-guide.html#schema-merging

    0 讨论(0)
  • 2020-12-04 18:47

    sqlContext.read.parquet can take multiple paths as input. If you want just day=5 and day=6, you can simply add two paths like:

    val dataframe = sqlContext
          .read.parquet("file:///your/path/data=jDD/year=2015/month=10/day=5/", 
                        "file:///your/path/data=jDD/year=2015/month=10/day=6/")
    

    If you have folders under day=X, like say country=XX, country will automatically be added as a column in the dataframe.

    EDIT: As of Spark 1.6 one needs to provide a "basepath"-option in order for Spark to generate columns automatically. In Spark 1.6.x the above would have to be re-written like this to create a dataframe with the columns "data", "year", "month" and "day":

    val dataframe = sqlContext
         .read
         .option("basePath", "file:///your/path/")
         .parquet("file:///your/path/data=jDD/year=2015/month=10/day=5/", 
                        "file:///your/path/data=jDD/year=2015/month=10/day=6/")
    
    0 讨论(0)
  • 2020-12-04 18:48

    If you want to read for multiple days, for example day = 5 and day = 6 and want to mention the range in the path itself, wildcards can be used:

    val dataframe = sqlContext
      .read
      .parquet("file:///your/path/data=jDD/year=2015/month=10/day={5,6}/*")
    

    Wildcards can also be used to specify a range of days:

    val dataframe = sqlContext
      .read
      .parquet("file:///your/path/data=jDD/year=2015/month=10/day=[5-10]/*")
    

    This matches all days from 5 to 10.

    0 讨论(0)
提交回复
热议问题