Spark - Joining 2 PairRDD elements

夙愿已清 提交于 2019-12-12 01:56:21

问题


Hi have a JavaRDDPair with 2 elements:

("TypeA", List<jsonTypeA>),

("TypeB", List<jsonTypeB>)

I need to combine the 2 pairs into 1 pair of type:

("TypeA_B", List<jsonCombinedAPlusB>)

I need to combine the 2 lists into 1 list, where each 2 jsons (1 of type A and 1 of type B) have some common field I can join on.

Consider that list of type A is significantly smaller than the other, and the join should be inner, so the result list should be as small as the list of type A.

What is the most efficient way to do that?


回答1:


rdd.join(otherRdd) provides you inner join on the first rdd. To use it, you will need to transform both RDDs to a PairRDD that has as key the common attribute that you will be joining on. Something like this (example, untested):

val rddAKeyed = rddA.keyBy{case (k,v) => key(v)}
val rddBKeyed = rddB.keyBy{case (k,v) => key(v)}

val joined = rddAKeyed.join(rddBKeyed).map{case (k,(json1,json2)) => (newK, merge(json1,json2))}

Where merge(j1,j2) is the specific business logic on how to join the two json objects.



来源:https://stackoverflow.com/questions/26992856/spark-joining-2-pairrdd-elements

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!