How to reliably write and restore partitioned data

前端 未结 1 695
北荒
北荒 2021-01-07 01:52

I am looking for a way to write and restore partitioned dataset. For the purpose of this question I can accept both partitioned RDD:

val partiti         


        
相关标签:
1条回答
  • 2021-01-07 02:18

    It might be achieved by bucketBy in dataframe/dataset api probably, but there is a catch - directly saving to parquet won't work, only saveAsTable works.

    Dataset<Row> parquet =...;
    parquet.write()
      .bucketBy(1000, "col1", "col2")
      .partitionBy("col3")
      .saveAsTable("tableName");
    
    sparkSession.read().table("tableName");
    

    Another apporach for spark core is to use custom RDD, e.g see https://github.com/apache/spark/pull/4449 - i.e. after reading hdfs rdd you kind of setup partitioner back, but it a bit hacky and not supported natively(so it need to be adjusted for every spark version)

    0 讨论(0)
提交回复
热议问题