pyspark and reduceByKey: how to make a simple sum | 易学教程

问题

I am trying some code in Spark (pyspark) for an assignment. First time I use this environment, so for sure I miss something…

I have a simple dataset called c_views.

If I run c_views.collect()

I get
[…] (u'ABC', 100), (u'DEF', 200), (u'XXX', 50), (u'XXX', 70)] […]

What I need to achieve is the sum across all words. So my guess is that I should get something like:

(u'ABC', 100), (u'DEF', 200), (u'XXX', 120)

So what I am trying to do is (following the hints in the assignment):

first I define the function sum_views(a,b) for the input dataset, and then run a reduceByKey i.e.

c_views.reduceByKey(sum_views).collect()

However I do not understand what exactly I have to code in the function. I am trying many things but I always get an error. Does the workflow make sense? Other simple ways to achieve the result?

Any suggestion? Thank you in advance for your help.

回答1:

Other simple ways to achieve the result?

from operator import add 

c_views.reduceByKey(add)

or if you prefer lambda expressions:

c_views.reduceByKey(lambda x, y: x + y)

I do not understand what exactly I have to code in the function

It has to be a function which takes two values of the same types as the values in your RDD and returns a value of the same type as inputs. It also has to be associative which means that the final result cannot depend how do you arrange parentheses.

来源：https://stackoverflow.com/questions/35070001/pyspark-and-reducebykey-how-to-make-a-simple-sum

标签

易学教程内所有资源均来自网络或用户发布的内容，如有违反法律规定的内容欢迎反馈！
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!