Spark groupByKey alternative

主宰稳场 提交于 2019-11-26 12:27:15

问题


According to Databricks best practices, Spark groupByKey should be avoided as Spark groupByKey processing works in a way that the information will be first shuffled across workers and then the processing will occur. Explanation

So, my question is, what are the alternatives for groupByKey in a way that it will return the following in a distributed and fast way?

// want this
{\"key1\": \"1\", \"key1\": \"2\", \"key1\": \"3\", \"key2\": \"55\", \"key2\": \"66\"}
// to become this
{\"key1\": [\"1\",\"2\",\"3\"], \"key2\": [\"55\",\"66\"]}

Seems to me that maybe aggregateByKey or glom could do it first in the partition (map) and then join all the lists together (reduce).


回答1:


groupByKey is fine for the case when we want a "smallish" collection of values per key, as in the question.

TL;DR

The "do not use" warning on groupByKey applies for two general cases:

1) You want to aggregate over the values:

  • DON'T: rdd.groupByKey().mapValues(_.sum)
  • DO: rdd.reduceByKey(_ + _)

In this case, groupByKey will waste resouces materializing a collection while what we want is a single element as answer.

2) You want to group very large collections over low cardinality keys:

  • DON'T: allFacebookUsersRDD.map(user => (user.likesCats, user)).groupByKey()
  • JUST DON'T

In this case, groupByKey will potentially result in an OOM error.

groupByKey materializes a collection with all values for the same key in one executor. As mentioned, it has memory limitations and therefore, other options are better depending on the case.

All the grouping functions, like groupByKey, aggregateByKey and reduceByKey rely on the base: combineByKey and therefore no other alternative will be better for the usecase in the question, they all rely on the same common process.



来源:https://stackoverflow.com/questions/31029395/spark-groupbykey-alternative

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!