Spark: How to “reduceByKey” when the keys are numpy arrays which are not hashable?

最后都变了- 提交于 2019-12-01 23:31:07

The simplest solution is to convert it to an object that is hashable. For example:

from operator import add

reduced = sc.parallelize(data).map(
    lambda x: (tuple(x), x.sum())
).reduceByKey(add)

and convert it back later if needed.

Is there a way to supply the Spark context with my manual hash function

Not a straightforward one. A whole mechanism depend on the fact object implements a __hash__ method and C extensions are cannot be monkey patched. You could try to use dispatching to override pyspark.rdd.portable_hash but I doubt it is worth it even if you consider the cost of conversions.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!