Combine two RDDs in pyspark

问题

Assuming that I have the following RDDs:

a = sc.parallelize([1, 2, 5, 3])
b = sc.parallelize(['a','c','d','e'])

How do I combine these 2 RDD to one RDD which would be like this:

[('a', 1), ('c', 2), ('d', 5), ('e', 3)]

Using a.union(b) just combines them in a list. Any idea?

回答1:

You probably just want to b.zip(a) both RDDs (note the reversed order since you want to key by b's values).

Just read the python docs carefully:

zip(other)

Zips this RDD with another one, returning key-value pairs with the first element in each RDD second element in each RDD, etc. Assumes that the two RDDs have the same number of partitions and the same number of elements in each partition (e.g. one was made through a map on the other).

x = sc.parallelize(range(0,5))
y = sc.parallelize(range(1000, 1005))
x.zip(y).collect()
[(0, 1000), (1, 1001), (2, 1002), (3, 1003), (4, 1004)]

来源：https://stackoverflow.com/questions/35085627/combine-two-rdds-in-pyspark

标签

apache-spark

pyspark

rdd

易学教程内所有资源均来自网络或用户发布的内容，如有违反法律规定的内容欢迎反馈！
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!