How to manage physical data placement of a dataframe across the cluster with pyspark?

后端 未结 1 1108
栀梦
栀梦 2020-12-12 01:46

Say I have a pyspark dataframe \'data\' as follows. I want to partition the data by \"Period\". Rather I want each period of data to be stored on it\'s own partition (see

相关标签:
1条回答
  • 2020-12-12 02:40

    The approach should be to repartition first, to have the right number of partitions(number of unique periods), and then partition by the Period column before saving it.

    from pyspark.sql import functions as F
    n = data.select(F.col('Period')).distinct().count()
    
    data.repartition(n)\
         .write \
         .partitionBy("Period")\
         .mode("overwrite")\
         .format("parquet")\
         .saveAsTable("testing")
    
    0 讨论(0)
提交回复
热议问题