问题
I know there are many questions on the same but none really answers my question.
I have scenario data.
val data_codes = Seq("con_dist_1","con_dist_2","con_dist_3","con_dist_4","con_dist_5")
val codes = data_codes.toDF("item_code")
val partitioned_codes = codes.repartition($"item_code")
println( "getNumPartitions : " + partitioned_codes.rdd.getNumPartitions);
Output :
getNumPartitions : 200
it suppose to give 5 right why it is giving 200 ? where am i doing wrong and how to fix this?
回答1:
Because 200 is the standard value of spark.sql.shuffle.partitions
which is applied to df.repartition
. From the docs :
Returns a new Dataset partitioned by the given partitioning expressions, using spark.sql.shuffle.partitions as number of partitions. The resulting Dataset is hash partitioned.
The number of partitions is NOT RELATED to the number of (distinct) values in your dataframe. Repartitioning ensures that all records with the same key are in the same partition, nothing else. So in your case it could be that all records are in 1 partition and 199 partitions are empty
Even if you do codes.repartition($"item_code",5)
, there is no guarantee that you have 5 equally sized partitions. AFAIK you cannot to this in dataframe API, maybe in RDD with custom partitioner
来源:https://stackoverflow.com/questions/60428425/how-to-get-the-number-of-partitions-in-a-dataset