SPARK : groupByKey vs reduceByKey which is better and efficient to combine the Maps?

问题

I have a data frame [df] :

+------------+-----------+------+
|   id       |  itemName |Value |
--------------------------------+
|   1        |  TV       |   4 |
|   1        |  Movie    |   5 |
|   2        |  TV       |   6 |

I am trying to transform it to :

{id : 1, itemMap : { "TV" : 4, "Movie" : 5}}
{id : 2, itemMap : {"TV" : 6}}

I want the final result to be in RDD[(String, String)] with itemMap as the value's name

So I am doing :

case class Data (itemMap : Map[String, Int]) extends Serializable


 df.map{
    case r =>
    val id = r.getAs[String]("id")
    val itemName = r.getAs[String]("itemName")
    val Value = r.getAs[Int]("Value")

    (id, Map(itemName -> Value ))

}.reduceByKey((x, y) => x ++ y).map{
      case (k, v) =>
        (k, JacksonUtil.toJson(Data(v)))
    }

But it takes forever to run. Is it efficient to use reducebyKey here ? Or can I use groupByKey ? Is there any other efficient way to do the transformation ?

My Config : I have 10 salves and a master of type r3.8xLarge

spark.driver.cores  30
spark.driver.memory 200g
spark.executor.cores    16
spark.executor.instances    40
spark.executor.memory   60g
spark.memory.fraction   0.95
spark.yarn.executor.memoryOverhead  2000

Is this the correct type of machine for this task ?

来源：https://stackoverflow.com/questions/40073047/spark-groupbykey-vs-reducebykey-which-is-better-and-efficient-to-combine-the-m

标签

scala

apache-spark

dataframe

group-by

rdd

易学教程内所有资源均来自网络或用户发布的内容，如有违反法律规定的内容欢迎反馈！
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!