Reshaping Spark RDD

社会主义新天地 提交于 2020-02-08 05:10:25

问题


I have a Spark RDD as follows:

rdd = sc.parallelize([('X01','Y01'),
                   ('X01','Y02'),
                   ('X01','Y03'),
                   ('X02','Y01'),
                   ('X02','Y06')])

I would like to convert them into the following format:

[('X01',('Y01','Y02','Y03')),
 ('X02',('Y01','Y06'))]

Can someone help me how to achieve this using PySpark?


回答1:


A simple groupByKey operation is what you need.

rdd.groupByKey().mapValues(lambda x: tuple(x.data)).collect()

Result: [('X02', ('Y01', 'Y06')), ('X01', ('Y01', 'Y02', 'Y03'))]




回答2:


Convert the RDD to PairRDD using mapToPair(// with key as first column and the value will be rest of the record) and do a groupByKey on the resultant RDD.




回答3:


as septra said, groupByKey methodis what you need. Further if you want to apply any operation on all the values to particular key then you can do the same with mapValues() method. This method will take one method(logic which you want to apply on grouped values) and apply to all the grouped values on per key. If you want both operation in one go, you can go for "reduceByKey" method. You can treat "reduceByKey() = groupByKey() + mapValues()"



来源:https://stackoverflow.com/questions/42085212/reshaping-spark-rdd

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!