Spark doing exchange of partitions already correctly distributed

扶醉桌前 提交于 2019-12-23 06:49:33

问题


I am joining 2 datasets by two columns and result is dataset containing 55 billion rows. After that I have to do some aggregation on this DS by different column than the ones used in join. Problem is that Spark is doing exchange partition after join(taking too much time with 55 billion rows) although data is already correctly distributed because aggregate column is unique. I know that aggregation key is correctly distributed and is there a way telling this to Spark app?


回答1:


1) Go to Spark UI and check "Locality Level"

2) If Joining a large and a small data use brodcast Join

3) If Joining a large and a medium size data and If the medium size RDD does not fit fully into memory use filter

val keys = sc.broadcast(mediumRDD.map(_._1).collect.toSet)
val reducedRDD = largeRDD.filter{ case(key, value) => keys.value.contains(key) }
reducedRDD.join(mediumRDD)

4) Check is data serilize or not

.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
      .set("spark.kryoserializer.buffer.max", "128m")
      .set("spark.kryoserializer.buffer", "64m")
      .registerKryoClasses(
        Array(classOf[ArrayBuffer[String]], classOf[ListBuffer[String]])

5) Check on Spark UI or add following line in code for debugging

df.rdd.getNumPartitions

Spark's application UI, you can see from the following screenshot that the "Total Tasks" represents the number of partitions



来源:https://stackoverflow.com/questions/46951207/spark-doing-exchange-of-partitions-already-correctly-distributed

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!