how to get the number of partitions in a dataset?

问题

I know there are many questions on the same but none really answers my question.

I have scenario data.

   val data_codes = Seq("con_dist_1","con_dist_2","con_dist_3","con_dist_4","con_dist_5")
    val codes = data_codes.toDF("item_code")
    val partitioned_codes = codes.repartition($"item_code")
    println( "getNumPartitions : " + partitioned_codes.rdd.getNumPartitions);

Output :

getNumPartitions : 200

it suppose to give 5 right why it is giving 200 ? where am i doing wrong and how to fix this?

回答1:

Because 200 is the standard value of spark.sql.shuffle.partitions which is applied to df.repartition. From the docs :

Returns a new Dataset partitioned by the given partitioning expressions, using spark.sql.shuffle.partitions as number of partitions. The resulting Dataset is hash partitioned.

The number of partitions is NOT RELATED to the number of (distinct) values in your dataframe. Repartitioning ensures that all records with the same key are in the same partition, nothing else. So in your case it could be that all records are in 1 partition and 199 partitions are empty

Even if you do codes.repartition($"item_code",5), there is no guarantee that you have 5 equally sized partitions. AFAIK you cannot to this in dataframe API, maybe in RDD with custom partitioner

来源：https://stackoverflow.com/questions/60428425/how-to-get-the-number-of-partitions-in-a-dataset

标签

apache-spark

apache-spark-sql

apache-spark-dataset

易学教程内所有资源均来自网络或用户发布的内容，如有违反法律规定的内容欢迎反馈！
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!