How to get the specified output without combineByKey and aggregateByKey in spark RDD

≯℡__Kan透↙ 提交于 2019-12-13 22:56:21

问题


Below is my data:

val keysWithValuesList = Array("foo=A", "foo=A", "foo=A", "foo=A", "foo=B", bar=C","bar=D", "bar=D")  

Now I want below types of output but without using combineByKey and aggregateByKey:

1) Array[(String, Int)] = Array((foo,5), (bar,3))  
2) Array((foo,Set(B, A)),
(bar,Set(C, D)))  

Below is my attempt:

scala> val keysWithValuesList = Array("foo=A", "foo=A", "foo=A", "foo=A", "foo=B", "bar=C",
     | "bar=D", "bar=D")  
scala> val sample=keysWithValuesList.map(_.split("=")).map(p=>(p(0),(p(1))))
sample: Array[(String, String)] = Array((foo,A), (foo,A), (foo,A), (foo,A), (foo,B), (bar,C), (bar,D), (bar,D))  

Now when I type the variable name followed by tab to see the applicable methods for the mapped RDD I can see the below options out of which none can satisfy my requirement:

scala> sample.
apply          asInstanceOf   clone          isInstanceOf   length         toString       update         

So how can I achieve this ??


回答1:


Here is a standard approach.

Point to note: you need to be working with an RDD. I think that is the bottleneck.

Here you go:

val keysWithValuesList = Array("foo=A", "foo=A", "foo=A", "foo=A", "foo=B", "bar=C","bar=D", "bar=D") 

val sample=keysWithValuesList.map(_.split("=")).map(p=>(p(0),(p(1))))

val sample2 = sc.parallelize(sample.map(x => (x._1, 1)))
val sample3 = sample2.reduceByKey(_+_) 
sample3.collect()

val sample4 = sc.parallelize(sample.map(x => (x._1, x._2))).groupByKey()   
sample4.collect()

val sample5 = sample4.map(x => (x._1, x._2.toSet))
sample5.collect()


来源:https://stackoverflow.com/questions/53150584/how-to-get-the-specified-output-without-combinebykey-and-aggregatebykey-in-spark

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!