I have a set of records which I need to:
1) Group by \'date\', \'city\' and \'kind\'
2) Sort every group by \'prize
In my code:
impor
Replace map
with flatMap
val x = rsGrp.map{r =>
val lst = r.toList
lst.map{e => (e.prize, e)}
}
this will give you a
org.apache.spark.rdd.RDD[(Int, Record)] = FlatMappedRDD[10]
and then you can call sortBy(_._1) on the RDD above.
As an alternative to @gasparms solution, I think one can try a filter followed by rdd.sortyBy operation. You filter each record that meets key criteria. Pre requisite is that you need to keep track of all your keys(filter combinations). You can also build it as you traverse through records.
groupByKey is expensive, it has 2 implications:
Depending of your use case you have different better options:
You need define a Key and then mapValues to sort them.
import org.apache.spark.{SparkContext, SparkConf}
import org.apache.spark.rdd.RDD
import org.apache.spark.SparkContext._
object Sort {
case class Record(name:String, day: String, kind: String, city: String, prize:Int)
// Define your data
def main(args: Array[String]): Unit = {
val conf = new SparkConf()
.setAppName("Test")
.setMaster("local")
.set("spark.executor.memory", "2g")
val sc = new SparkContext(conf)
val rs = sc.parallelize(recs)
// Generate pair RDD neccesary to call groupByKey and group it
val key: RDD[((String, String, String), Iterable[Record])] = rs.keyBy(r => (r.day, r.city, r.kind)).groupByKey
// Once grouped you need to sort values of each Key
val values: RDD[((String, String, String), List[Record])] = key.mapValues(iter => iter.toList.sortBy(_.prize))
// Print result
values.collect.foreach(println)
}
}