问题
I'm a Spark noobie and I'm trying to test something out on Spark and see if there are any performance boosts for the size of data that I'm using.
Each object in my rdd contains a time, id, and position.
I want to compare the positions of groups with same times containing the same id. So, I would first run the following to get grouped by id
grouped_rdd = rdd.map(lambda x: (x.id, [x])).groupByKey()
I would then like to break this into the time of each object.
Any suggestions? Thanks!
回答1:
First of all, if you want both id
and time
as key, just put them both into key part, rather than first group by id, then group by time separately.
m = sc.parallelize([(1,2,3),(1,2,4),(2,3,5)])
n = m.map(lambda x: ((x[0], x[1]), x[2]))
Secondly, Avoid GroupByKey which performs bad and use combineByKey
or reduceByKey
if possible.
来源:https://stackoverflow.com/questions/30338185/pyspark-suggestion-on-how-to-organize-rdd