Reshaping Spark RDD

问题

I have a Spark RDD as follows:

rdd = sc.parallelize([('X01','Y01'),
                   ('X01','Y02'),
                   ('X01','Y03'),
                   ('X02','Y01'),
                   ('X02','Y06')])

I would like to convert them into the following format:

[('X01',('Y01','Y02','Y03')),
 ('X02',('Y01','Y06'))]

Can someone help me how to achieve this using PySpark?

回答1:

A simple groupByKey operation is what you need.

rdd.groupByKey().mapValues(lambda x: tuple(x.data)).collect()

Result: [('X02', ('Y01', 'Y06')), ('X01', ('Y01', 'Y02', 'Y03'))]

回答2:

Convert the RDD to PairRDD using mapToPair(// with key as first column and the value will be rest of the record) and do a groupByKey on the resultant RDD.

回答3:

as septra said, groupByKey methodis what you need. Further if you want to apply any operation on all the values to particular key then you can do the same with mapValues() method. This method will take one method(logic which you want to apply on grouped values) and apply to all the grouped values on per key. If you want both operation in one go, you can go for "reduceByKey" method. You can treat "reduceByKey() = groupByKey() + mapValues()"

来源：https://stackoverflow.com/questions/42085212/reshaping-spark-rdd

标签

apache-spark

pyspark

易学教程内所有资源均来自网络或用户发布的内容，如有违反法律规定的内容欢迎反馈！
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!