A list as a key for PySpark's reduceByKey

天涯浪子 提交于 2019-11-26 09:59:18

问题


I am attempting to call the reduceByKey function of pyspark on data of the format (([a,b,c], 1), ([a,b,c], 1), ([a,d,b,e], 1), ...

It seems pyspark will not accept an array as the key in normal key, value reduction by simply applying .reduceByKey(add).

I have already tried first converting the array to a string, by .map((x,y): (str(x),y)) but this does not work because post processing of the strings back into arrays is too slow.

Is there a way I can make pyspark use the array as a key or use another function to quickly convert the strings back to arrays?

here is the associated error code

  File \"/home/jan/Documents/spark-1.4.0/python/lib/pyspark.zip/pyspark/shuffle.py\", line 268, in mergeValues
    d[k] = comb(d[k], v) if k in d else creator(v)
TypeError: unhashable type: \'list\'
    enter code here

SUMMARY:

input:x =[([a,b,c], 1), ([a,b,c], 1), ([a,d,b,e], 1), ...]

desired output :y =[([a,b,c], 2), ([a,d,b,e], 1),...] such that I could access a by y[0][0][0] and 2 by y[0][1]


回答1:


Try this:

rdd.map(lambda (k, v): (tuple(k), v)).groupByKey()

Since Python lists are mutable it means that cannot be hashed (don't provide __hash__ method):

>>> a_list = [1, 2, 3]
>>> a_list.__hash__ is None
True
>>> hash(a_list)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: unhashable type: 'list'

Tuples from the other hand are immutable and provide __hash__ method implementation:

>>> a_tuple = (1, 2, 3)
>>> a_tuple.__hash__ is None
False
>>> hash(a_tuple)
2528502973977326415

hence can be used as a key. Similarly if you want to use unique values as a key you should use frozenset:

rdd.map(lambda (k, v): (frozenset(k), v)).groupByKey().collect()

instead of set.

# This will fail with TypeError: unhashable type: 'set'
rdd.map(lambda (k, v): (set(k), v)).groupByKey().collect()


来源:https://stackoverflow.com/questions/31404238/a-list-as-a-key-for-pysparks-reducebykey

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!