Spark: Using iterator lambda function in RDD map()

一曲冷凌霜 提交于 2019-12-13 19:19:00

问题


I have simple dataset on HDFS that I'm loading into Spark. It looks like this:

1 1 1 1 1 1 1
1 1 1 1 1 1 1
1 1 1 1 1 1 1
...

basically, a matrix. I'm trying to implement something that requires grouping matrix rows, and so I'm trying to add a unique key for every row like so:

(1, [1 1 1 1 1 ... ])
(2, [1 1 1 1 1 ... ])
(3, [1 1 1 1 1 ... ])
...

I tried something somewhat naive: set a global variable and write a lambda function to iterate over the global variable:

# initialize global index
global global_index
global_index = 0

# function to generate keys
def generateKeys(x):
    global_index+=1
    return (global_index,x)

# read in data and operate on it
data = sc.textFile("/data.txt")

...some preprocessing...

data.map(generateKeys)

And it seemed to not recognize the existence of the global variable.

Is there an easy way that comes to mind to do this?

Thanks, Jack


回答1:


>>> lsts = [
...     [1, 1, 1, 1, 1, 1],
...     [1, 1, 1, 1, 1, 1],
...     [1, 1, 1, 1, 1, 1],
...     [1, 1, 1, 1, 1, 1],
...     [1, 1, 1, 1, 1, 1],
...     [1, 1, 1, 1, 1, 1],
...     [1, 1, 1, 1, 1, 2],
...     [1, 1, 1, 2, 1, 2]
...     ]
...
>>> list(enumerate(lsts))
[(0, [1, 1, 1, 1, 1, 1]),
 (1, [1, 1, 1, 1, 1, 1]),
 (2, [1, 1, 1, 1, 1, 1]),
 (3, [1, 1, 1, 1, 1, 1]),
 (4, [1, 1, 1, 1, 1, 1]),
 (5, [1, 1, 1, 1, 1, 1]),
 (6, [1, 1, 1, 1, 1, 2]),
 (7, [1, 1, 1, 2, 1, 2])]

enumerate generates unique index for each item in the iterable and yields tuples with values (index, original_item)

If you want to start numbering with other than 0, pass the starting value to enumerate as second parameter.

>>> list(enumerate(lsts, 1))
[(1, [1, 1, 1, 1, 1, 1]),
 (2, [1, 1, 1, 1, 1, 1]),
 (3, [1, 1, 1, 1, 1, 1]),
 (4, [1, 1, 1, 1, 1, 1]),
 (5, [1, 1, 1, 1, 1, 1]),
 (6, [1, 1, 1, 1, 1, 1]),
 (7, [1, 1, 1, 1, 1, 2]),
 (8, [1, 1, 1, 2, 1, 2])]

Note, that the list is used to get real values out from enumerate which is iterator and not a function, returning lists.

Alternative: globally available id assigner

enumerate is easy to use, but if you would need to assing id in diferrent pieces of your code, it would become difficult or impossible. For such a case, globally available generator (as drafter in OP) would be the way to go.

itertools provide count which can serve our need:

>>> from itertools import count
>>> idgen = count()

Now we have (globally available) idgen generator ready to yield unique ids.

We can test it by a function prid (print id):

>>> def prid():
...     id = idgen.next()
...     print id
...
>>> prid()
0
>>> prid()
1
>>> prid()
2
>>> prid()
3

As it works we can test it on list of values:

>>> lst = ['100', '101', '102', '103', '104', '105', '106', '107', '108', '109']

and define actual function, which when called with a value would return tuple (id, value)

>>> def assignId(val):
...     return (idgen.next(), val)
...

note, that there is no need to declare idgen as global as we are not going to change it's value (the idgen will only change it's internal status when called, but will still remain the same generator).

Test, if it works:

>>> assignId("ahahah")
(4, 'ahahah')

and try it on the list:

>>> map(assignId, lst)
[(5, '100'),
 (6, '101'),
 (7, '102'),
 (8, '103'),
 (9, '104'),
 (10, '105'),
 (11, '106'),
 (12, '107'),
 (13, '108'),
 (14, '109')]

The main diferrence to enumerate solution is, we can assign ids one by one anywhere in the code without doing it all from within one all processing enumerate.

>>> assignId("lonely line")
(15, 'lonely line')



回答2:


try dataRdd.zipWithIndex and eventually swap the resulting tuple if having the index first is a must.



来源:https://stackoverflow.com/questions/24689363/spark-using-iterator-lambda-function-in-rdd-map

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!