take top N after groupBy and treat them as RDD

后端 未结 4 486
死守一世寂寞
死守一世寂寞 2020-12-10 08:09

I\'d like to get top N items after groupByKey of RDD and convert the type of topNPerGroup(in the below) to RDD[(String, Int)]

4条回答
  •  刺人心
    刺人心 (楼主)
    2020-12-10 08:17

    Spark 1.4.0 solves the question.

    Take a look at https://github.com/apache/spark/commit/5e6ad24ff645a9b0f63d9c0f17193550963aa0a7

    This uses BoundedPriorityQueue with aggregateByKey

    def topByKey(num: Int)(implicit ord: Ordering[V]): RDD[(K, Array[V])] = {
      self.aggregateByKey(new BoundedPriorityQueue[V](num)(ord))(
        seqOp = (queue, item) => {
          queue += item
        },
        combOp = (queue1, queue2) => {
          queue1 ++= queue2
        }
      ).mapValues(_.toArray.sorted(ord.reverse))  // This is an min-heap, so we reverse the order.
    }
    

提交回复
热议问题