PySpark Suggestion on how to organize RDD

问题

I'm a Spark noobie and I'm trying to test something out on Spark and see if there are any performance boosts for the size of data that I'm using.

Each object in my rdd contains a time, id, and position.

I want to compare the positions of groups with same times containing the same id. So, I would first run the following to get grouped by id

grouped_rdd = rdd.map(lambda x: (x.id, [x])).groupByKey()

I would then like to break this into the time of each object.

Any suggestions? Thanks!

回答1:

First of all, if you want both id and time as key, just put them both into key part, rather than first group by id, then group by time separately.

m = sc.parallelize([(1,2,3),(1,2,4),(2,3,5)])
n = m.map(lambda x: ((x[0], x[1]), x[2]))

Secondly, Avoid GroupByKey which performs bad and use combineByKey or reduceByKey if possible.

来源：https://stackoverflow.com/questions/30338185/pyspark-suggestion-on-how-to-organize-rdd

标签

apache-spark

rdd

易学教程内所有资源均来自网络或用户发布的内容，如有违反法律规定的内容欢迎反馈！
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!