How to Determine The Partition Size in an Apache Spark Dataframe

社会主义新天地 提交于 2020-12-13 03:32:47

问题


I have been using an excellent answer to a question posted on SE here to determine the number of partitions, and the distribution of partitions across a dataframe Need to Know Partitioning Details in Dataframe Spark

Can someone help me expand on answers to determine the partition size of dataframe?

Thanks


回答1:


Tuning the partition size is inevitably, linked to tuning the number of partitions. There're at least 3 factors to consider in this scope:

Level of parallelism

A "good" high level of parallelism is important, so you may want to have a big number of partitions, resulting in a small partition size.

However, there is an upper bound of the number due to the following 3rd point - distribution overhead. Nevertheless, it's still ranked priority #1, so let's say if you have to make a mistake, start with the side of high level of parallelism.

Generally, it's recommended 2 to 4 tasks per core.

  • Spark doc:

In general, we recommend 2-3 tasks per CPU core in your cluster.

  • The book Spark in action (author Petar Zecevi´c) writes (page 74):

We recommend using three to four times more partitions than there are cores in your cluster

Memory fitting

If partition size is very large (e.g. > 1 GB), you may have issues such as garbage collection, out of memory error, etc., especially when there's shuffle operation, as per Spark doc:

Sometimes, you will get an OutOfMemoryError, not because your RDDs don’t fit in memory, but because the working set of one of your tasks, such as one of the reduce tasks in groupByKey, was too large. Spark’s shuffle operations (sortByKey, groupByKey, reduceByKey, join, etc) build a hash table within each task to perform the grouping, which can often be large...

Hence here comes another pros of big number of partitions (or, small partition size).

Distribution overhead

Distributed computing comes with overhead, so you can't go to an extreme either. If each task takes less than 100ms to execute, the application might have remarkable overhead due to:

  • data fetches, disk seeks
  • data movement, tasks distributing
  • task state tracking

, in which case you may lower the level of parallelism and increase partition size a bit.

Take-away

Empirically, people usually try with 100-1000MB per partition, so why not start with that? And remember that the number may need to re-tuned along the time..



来源:https://stackoverflow.com/questions/64600212/how-to-determine-the-partition-size-in-an-apache-spark-dataframe

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!