cross combine two RDDs using pyspark

感情迁移 提交于 2019-12-24 03:24:50

问题


How can I cross combine (is this the correct way to describe?) the two RDDS?

input:

rdd1 = [a, b]
rdd2 = [c, d]

output:

rdd3 = [(a, c), (a, d), (b, c), (b, d)]

I tried rdd3 = rdd1.flatMap(lambda x: rdd2.map(lambda y: (x, y)), it complains that It appears that you are attempting to broadcast an RDD or reference an RDD from an action or transformation.. I guess that means you can not nest action as in the list comprehension, and one statement can only do one action.


回答1:


So as you have noticed you can't perform a transformation inside another transformation (note that flatMap & map are transformations rather than actions since they return RDDs). Thankfully, what your trying to accomplish is directly supported by another transformation in the Spark API - namely cartesian (see http://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.RDD ).

So you would want to do rdd1.cartesian(rdd2).




回答2:


You can use the cartesian transformation. Here's an example from the documentation:

>>> rdd = sc.parallelize([1,2])
>>> sorted(rdd.cartesian(rdd).collect())
[(1, 1), (1, 2), (2, 1), (2, 2)]

in your case, you'll do rdd3 = rdd1.cartesian(rdd2)



来源:https://stackoverflow.com/questions/31062168/cross-combine-two-rdds-using-pyspark

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!