How to Determine The Partition Size in an Apache Spark Dataframe

问题

I have been using an excellent answer to a question posted on SE here to determine the number of partitions, and the distribution of partitions across a dataframe Need to Know Partitioning Details in Dataframe Spark

Can someone help me expand on answers to determine the partition size of dataframe?

Thanks

回答1:

Tuning the partition size is inevitably, linked to tuning the number of partitions. There're at least 3 factors to consider in this scope:

Level of parallelism

A "good" high level of parallelism is important, so you may want to have a big number of partitions, resulting in a small partition size.

However, there is an upper bound of the number due to the following 3rd point - distribution overhead. Nevertheless, it's still ranked priority #1, so let's say if you have to make a mistake, start with the side of high level of parallelism.

Generally, it's recommended 2 to 4 tasks per core.

Spark doc:

In general, we recommend 2-3 tasks per CPU core in your cluster.

The book Spark in action (author Petar Zecevi´c) writes (page 74):

We recommend using three to four times more partitions than there are cores in your cluster

Memory fitting

If partition size is very large (e.g. > 1 GB), you may have issues such as garbage collection, out of memory error, etc., especially when there's shuffle operation, as per Spark doc:

Sometimes, you will get an OutOfMemoryError, not because your RDDs don’t fit in memory, but because the working set of one of your tasks, such as one of the reduce tasks in groupByKey, was too large. Spark’s shuffle operations (sortByKey, groupByKey, reduceByKey, join, etc) build a hash table within each task to perform the grouping, which can often be large...

Hence here comes another pros of big number of partitions (or, small partition size).

Distribution overhead

Distributed computing comes with overhead, so you can't go to an extreme either. If each task takes less than 100ms to execute, the application might have remarkable overhead due to:

data fetches, disk seeks
data movement, tasks distributing
task state tracking

, in which case you may lower the level of parallelism and increase partition size a bit.

Take-away

Empirically, people usually try with 100-1000MB per partition, so why not start with that? And remember that the number may need to re-tuned along the time..

来源：https://stackoverflow.com/questions/64600212/how-to-determine-the-partition-size-in-an-apache-spark-dataframe

标签

apache-spark

pyspark

databricks