问题
I am trying some code in Spark (pyspark) for an assignment. First time I use this environment, so for sure I miss something…
I have a simple dataset called c_views.
If I run
c_views.collect()
I get[…]
(u'ABC', 100),
(u'DEF', 200),
(u'XXX', 50),
(u'XXX', 70)]
[…]
What I need to achieve is the sum across all words. So my guess is that I should get something like:
(u'ABC', 100),
(u'DEF', 200),
(u'XXX', 120)
So what I am trying to do is (following the hints in the assignment):
first I define the function sum_views(a,b)
for the input dataset,
and then run a reduceByKey i.e.
c_views.reduceByKey(sum_views).collect()
However I do not understand what exactly I have to code in the function. I am trying many things but I always get an error. Does the workflow make sense? Other simple ways to achieve the result?
Any suggestion? Thank you in advance for your help.
回答1:
Other simple ways to achieve the result?
from operator import add
c_views.reduceByKey(add)
or if you prefer lambda expressions:
c_views.reduceByKey(lambda x, y: x + y)
I do not understand what exactly I have to code in the function
It has to be a function which takes two values of the same types as the values in your RDD and returns a value of the same type as inputs. It also has to be associative which means that the final result cannot depend how do you arrange parentheses.
来源:https://stackoverflow.com/questions/35070001/pyspark-and-reducebykey-how-to-make-a-simple-sum