How to manage physical data placement of a dataframe across the cluster with pyspark?

后端未结

关注

 1  1112

Say I have a pyspark dataframe \'data\' as follows. I want to partition the data by \"Period\". Rather I want each period of data to be stored on it\'s own partition (see

相关标签:

1条回答

温柔的废话

2020-12-12 02:40
The approach should be to repartition first, to have the right number of partitions(number of unique periods), and then partition by the Period column before saving it.
```
from pyspark.sql import functions as F
n = data.select(F.col('Period')).distinct().count()

data.repartition(n)\
     .write \
     .partitionBy("Period")\
     .mode("overwrite")\
     .format("parquet")\
     .saveAsTable("testing")
```
0 讨论(0)
发布评论:

提交评论
- 加载中...