Not able to set number of shuffle partition in pyspark

问题

I know that by default, the number of partition for tasks is set to 200 in spark. I can't seem to change this. I'm running jupyter with spark 1.6.

I'm loading a fairly small table with about 37K rows from hive using the following in my notebook

from pyspark.sql.functions import *
sqlContext.sql("set spark.sql.shuffle.partitions=10")
test= sqlContext.table('some_table')
print test.rdd.getNumPartitions()
print test.count()

The output confirms 200 tasks. From the activity log, it's spinning up 200 tasks, which is an overkill. it seems like line number 2 above is ignored. So, I tried the following:

test = sqlContext.table('gfcctdmn_work.icgdeskrev_emma_cusip_activity_bw').repartition(5)

and create a new cell:

print test.rdd.getNumPartitions()
print test.count()

The output shows 5 partitions, but the log shows 200 tasks being spun up for the count, and then repartition to 5 took place after. However, if I convert it first to RDD, and back to DataFrame as follow:

 test = sqlContext.table('gfcctdmn_work.icgdeskrev_emma_cusip_activity_bw').repartition(5).rdd

and create a new cell:

print test.getNumPartitions()
print test.toDF().count()

The very first time I ran the new cell, it's still running with 200 tasks. However, the second time I ran the new cell, it ran with 5 tasks.

How can I make the code run with 5 tasks the very first time it's running?

Would you mind explaining why this behaves this way(specifying number of partition, but it's still running under default settings)? Is it because the defauly Hive table was created using 200 partitions?

回答1:

At the beginning of your notebook, do something like this:

from pyspark.conf import SparkConf
sc.stop()
conf = SparkConf().setAppName("test")
conf.set("spark.default.parallelism", 10)
sc = SparkContext(conf=conf)

When the notebook starts you have already a SparkContext created for you, but still you can change configuration and recreate it.

As for spark.default.parallelism, I understand it is what you need, take a look here:

Default number of partitions in RDDs returned by transformations like join, reduceByKey, and parallelize when not set by user.

来源：https://stackoverflow.com/questions/43792276/not-able-to-set-number-of-shuffle-partition-in-pyspark

标签

apache-spark

pyspark

spark-dataframe