Spark: Sort records in groups?

后端 未结 4 1186
忘掉有多难
忘掉有多难 2020-12-31 11:34

I have a set of records which I need to:

1) Group by \'date\', \'city\' and \'kind\'

2) Sort every group by \'prize

In my code:

impor         


        
4条回答
  •  灰色年华
    2020-12-31 12:29

    groupByKey is expensive, it has 2 implications:

    1. Majority of the data get shuffled in the remaining N-1 partitions in average.
    2. All of the records of the same key get loaded in memory in the single executor potentially causing memory errors.

    Depending of your use case you have different better options:

    1. If you don't care about the ordering, use reduceByKey or aggregateByKey.
    2. If you want to just group and sort without any transformation, prefer using repartitionAndSortWithinPartitions (Spark 1.3.0+ http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.rdd.OrderedRDDFunctions) but be very careful of what partitioner you specify and test it because you are now relying on side effects that may change behaviour in a different environment. See also examples in this repository: https://github.com/sryza/aas/blob/master/ch08-geotime/src/main/scala/com/cloudera/datascience/geotime/RunGeoTime.scala.
    3. If you are either applying a transformation or a non reducible aggregation (fold or scan) applied to the iterable of sorted records, then check out this library: spark-sorted https://github.com/tresata/spark-sorted. It provides 3 APIs for paired rdds: mapStreamByKey, foldLeftByKey and scanLeftByKey.

提交回复
热议问题