Spark: Sort records in groups?

后端 未结 4 1174
忘掉有多难
忘掉有多难 2020-12-31 11:34

I have a set of records which I need to:

1) Group by \'date\', \'city\' and \'kind\'

2) Sort every group by \'prize

In my code:

impor         


        
相关标签:
4条回答
  • 2020-12-31 12:14

    Replace map with flatMap

    val x = rsGrp.map{r => 
      val lst = r.toList
      lst.map{e => (e.prize, e)}
      }
    

    this will give you a

    org.apache.spark.rdd.RDD[(Int, Record)] = FlatMappedRDD[10]
    

    and then you can call sortBy(_._1) on the RDD above.

    0 讨论(0)
  • 2020-12-31 12:24

    As an alternative to @gasparms solution, I think one can try a filter followed by rdd.sortyBy operation. You filter each record that meets key criteria. Pre requisite is that you need to keep track of all your keys(filter combinations). You can also build it as you traverse through records.

    0 讨论(0)
  • 2020-12-31 12:29

    groupByKey is expensive, it has 2 implications:

    1. Majority of the data get shuffled in the remaining N-1 partitions in average.
    2. All of the records of the same key get loaded in memory in the single executor potentially causing memory errors.

    Depending of your use case you have different better options:

    1. If you don't care about the ordering, use reduceByKey or aggregateByKey.
    2. If you want to just group and sort without any transformation, prefer using repartitionAndSortWithinPartitions (Spark 1.3.0+ http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.rdd.OrderedRDDFunctions) but be very careful of what partitioner you specify and test it because you are now relying on side effects that may change behaviour in a different environment. See also examples in this repository: https://github.com/sryza/aas/blob/master/ch08-geotime/src/main/scala/com/cloudera/datascience/geotime/RunGeoTime.scala.
    3. If you are either applying a transformation or a non reducible aggregation (fold or scan) applied to the iterable of sorted records, then check out this library: spark-sorted https://github.com/tresata/spark-sorted. It provides 3 APIs for paired rdds: mapStreamByKey, foldLeftByKey and scanLeftByKey.
    0 讨论(0)
  • 2020-12-31 12:38

    You need define a Key and then mapValues to sort them.

    import org.apache.spark.{SparkContext, SparkConf}
    import org.apache.spark.rdd.RDD
    import org.apache.spark.SparkContext._
    
      object Sort {
    
        case class Record(name:String, day: String, kind: String, city: String, prize:Int)
    
        // Define your data
    
        def main(args: Array[String]): Unit = {
          val conf = new SparkConf()
            .setAppName("Test")
            .setMaster("local")
            .set("spark.executor.memory", "2g")
          val sc = new SparkContext(conf)
          val rs = sc.parallelize(recs)
    
          // Generate pair RDD neccesary to call groupByKey and group it
          val key: RDD[((String, String, String), Iterable[Record])] = rs.keyBy(r => (r.day, r.city, r.kind)).groupByKey
    
          // Once grouped you need to sort values of each Key
          val values: RDD[((String, String, String), List[Record])] = key.mapValues(iter => iter.toList.sortBy(_.prize))
    
          // Print result
          values.collect.foreach(println)
        }
    }
    
    0 讨论(0)
提交回复
热议问题