How to understand reduceByKey in Spark?

本小妞迷上赌 提交于 2019-12-04 22:06:14

Lets say you have [("key", [13,445]), ("key", [14,109]), ("key", [15,309])]

When this is passed to reduceByKey, it will group all the values with same key into one executor i.e. [13,445], [14,109], [15,309] and iterates among the values

In the first iterate x is [13,445] and y is [14,109] and the output is max(x[1], y[1]) i.e. max(445, 109) which is 445

In the second iterate x is 445 i.e. max of previous loop and y is [15,309]

Now when the second element of x is tried to be obtained by x[1] and 445 is just an integer, the error occurs

TypeError: 'int' object is not subscriptable

I hope the meaning of the error is clear. You can find more details in my other answer

The above explanation also explains why the solution proposed by @pault in the comments section works i.e.

reduceByKey(lambda x, y: (x[0], max(x[1], y[1])))
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!