问题
I'm trying to pre-partition the data before doing an aggregation operation across a certain column of my data. I have 3 worker nodes and I would llike each partition to have non-overlapping values in the column I am partitioning on. I don't want to have situations where two partitions might have the same values in the column.
e.g. If I have the following data
ss_item_sk | ss_quantity
1 | 10.0
1 | 4.0
2 | 3.0
3 | 5.0
4 | 8.0
5 | 13.0
5 | 10.0
Then the following partitions are satisfactory:
partition 1
ss_item_sk | ss_quantity
1 | 10.0
1 | 4.0
partition 2
ss_item_sk | ss_quantity
2 | 3.0
3 | 5.0
partition 3
ss_item_sk | ss_quantity
4 | 8.0
5 | 13.0
5 | 10.0
Unfortunately, the code I have below does not work.
spark.sqlContext.setConf( "spark.sql.shuffle.partitions", "3")
var json = spark.read.json("hdfs://master:9000/tpcds/store_sales")
var filtered = json.filter(row => row.getAs[Long]("ss_item_sk") < 180)
filtered.repartition($"ss_item_sk").write.json(savepath)
I have already looked at
- How to define partitioning of DataFrame?
- Spark SQL - Difference between df.repartition and DataFrameWriter partitionBy?
- pyspark: Efficiently have partitionBy write to same number of total partitions as original table
and I still can't figure it out.
回答1:
Repartition by key does an distribution of data based on a key in dataframe level. While writing a dataframe on hdfs is a separate thing. you can try
df.coalesce(1).write.partitionBy("ss_item_sk").json(savepath)
In this scenario as well you will see multiple part files in different directory created by partitioned column. The number of writer/reducer that will run can only be controlled based on "partitionBy" method. Its very similar like Map Reduce Partitioner as it controls the number of reducer will run. To get a single file based on the partition column you have to run this command.
df.repartition($"ss_item_sk").write.partitionBy("ss_item_sk").json(savepath)
Now this works as the reducer is getting mapped with the number of executor partition. Hope this helps
来源:https://stackoverflow.com/questions/54060188/pre-partition-data-in-spark-such-that-each-partition-has-non-overlapping-values