apache-spark-1.4

How to optimize shuffle spill in Apache Spark application

和自甴很熟 提交于 2019-11-26 18:54:27
问题 I am running a Spark streaming application with 2 workers. Application has a join and an union operations. All the batches are completing successfully but noticed that shuffle spill metrics are not consistent with input data size or output data size (spill memory is more than 20 times). Please find the spark stage details in the below image: After researching on this, found that Shuffle spill happens when there is not sufficient memory for shuffle data. Shuffle spill (memory) - size of the

DataFrame join optimization - Broadcast Hash Join

梦想的初衷 提交于 2019-11-26 12:56:27
I am trying to effectively join two DataFrames, one of which is large and the second is a bit smaller. Is there a way to avoid all this shuffling? I cannot set autoBroadCastJoinThreshold , because it supports only Integers - and the table I am trying to broadcast is slightly bigger than integer number of bytes. Is there a way to force broadcast ignoring this variable? Broadcast Hash Joins (similar to map side join or map-side combine in Mapreduce) : In SparkSQL you can see the type of join being performed by calling queryExecution.executedPlan . As with core Spark, if one of the tables is much

DataFrame join optimization - Broadcast Hash Join

走远了吗. 提交于 2019-11-26 03:09:17
问题 I am trying to effectively join two DataFrames, one of which is large and the second is a bit smaller. Is there a way to avoid all this shuffling? I cannot set autoBroadCastJoinThreshold , because it supports only Integers - and the table I am trying to broadcast is slightly bigger than integer number of bytes. Is there a way to force broadcast ignoring this variable? 回答1: Broadcast Hash Joins (similar to map side join or map-side combine in Mapreduce) : In SparkSQL you can see the type of