Spark - Group by Key then Count by Value

问题

I have non-unique key-value pairs that I have created using the map function from an RDD Array[String]

val kvPairs = myRdd.map(line => (line(0), line(1)))

This produces data of format:

1, A
1, A
1, B
2, C

I would like to group all of they keys by their values and provide the counts for these values like so:

1, {(A, 2), (B, 1)}
2, {(C, 1)}

I have tried many different attempts, but the closest I can get is with something like this:

kvPairs.sortByKey().countByValue()

This gives

1, (A, 2)
1, (B, 1)
2, (C, 1)

Also,

kvPairs.groupByKey().sortByKey()

Provides value, but it still isn't quite there:

1, {(A, A, B)}
2, {(C)}

I tried combining the two together:

kvPairs.countByValue().groupByKey().sortByKey()

But this return an error

error: value groupByKey is not a member of scala.collection.Map[(String, String),Long]

回答1:

Just count pairs directly and group (if you have to) afterwards:

kvPairs.map((_, 1L))
  .reduceByKey(_ + _)
  .map{ case ((k, v), cnt) => (k, (v, cnt)) }
  .groupByKey

If you want to gropuByKey after reducing you may want to use custom partitioner which considers only the first element of the key. You can check RDD split and do aggregation on new RDDs for an example implementation.

来源：https://stackoverflow.com/questions/35763284/spark-group-by-key-then-count-by-value

标签

scala

apache-spark

MapReduce

rdd

key-value

易学教程内所有资源均来自网络或用户发布的内容，如有违反法律规定的内容欢迎反馈！
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!