How to control partition size in Spark SQL

旧巷老猫 提交于 2019-11-26 11:22:52

Spark < 2.0:

You can use Hadoop configuration options:

  • mapred.min.split.size.
  • mapred.max.split.size

as well as HDFS block size to control partition size for filesystem based formats*.

val minSplit: Int = ???
val maxSplit: Int = ???

sc.hadoopConfiguration.setInt("mapred.min.split.size", minSplit)
sc.hadoopConfiguration.setInt("mapred.max.split.size", maxSplit)

Spark 2.0+:

You can use spark.sql.files.maxPartitionBytes configuration:

spark.conf.set("spark.sql.files.maxPartitionBytes", maxSplit)

In both cases these values may not be in use by a specific data source API so you should always check documentation / implementation details of the format you use.


* Other input formats can use different settings. See for example

Furthermore Datasets created from RDDs will inherit partition layout from their parents.

Similarly bucketed tables will use bucket layout defined in the metastore with 1:1 relationship between bucket and Dataset partition.

A very common and painful problem. You should look for a key which distributes the data in uniform partitions. The you can use the DISTRIBUTE BY and CLUSTER BY operators to tell spark to group rows in a partition. This will incur some overhead on the query itself. But will result in evenly sized partitions. Deepsense has a very good tutorial on this.

If your SQL performs a shuffle (for example it has a join, or some sort of group by), you can set the number of partitions by setting the 'spark.sql.shuffle.partitions' property

 sqlContext.setConf( "spark.sql.shuffle.partitions", 64)

Following up on what Fokko suggests, you could use a random variable to cluster by.

val result = sqlContext.sql("""
   select * from (
     select *,random(64) as rand_part from bt_st_ent
   ) cluster by rand_part""")
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!