How can I efficiently join a large rdd to a very large rdd in spark?

僤鯓⒐⒋嵵緔 提交于 2019-12-03 06:13:07

You can partition RDD's with the same partitioner, in this case partitions with the same key will be collocated on the same executor.

In this case you will avoid shuffle for join operations.

Shuffle will happen only once, when you'll update parititoner, and if you'll cache RDD's all joins after that should be local to executors

import org.apache.spark.SparkContext._

class A
class B

val rddA: RDD[(String, A)] = ???
val rddB: RDD[(String, B)] = ???

val partitioner = new HashPartitioner(1000)

rddA.partitionBy(partitioner).cache()
rddB.partitionBy(partitioner).cache()

Also you can try to update broadcast threshold size, maybe rddA can broadcasted:

--conf spark.sql.autoBroadcastJoinThreshold=300000000 # ~300 mb

We use 400mb for broadcast joins, and it works well.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!