问题
I have simple dataset on HDFS that I'm loading into Spark. It looks like this:
1 1 1 1 1 1 1
1 1 1 1 1 1 1
1 1 1 1 1 1 1
...
basically, a matrix. I'm trying to implement something that requires grouping matrix rows, and so I'm trying to add a unique key for every row like so:
(1, [1 1 1 1 1 ... ])
(2, [1 1 1 1 1 ... ])
(3, [1 1 1 1 1 ... ])
...
I tried something somewhat naive: set a global variable and write a lambda function to iterate over the global variable:
# initialize global index
global global_index
global_index = 0
# function to generate keys
def generateKeys(x):
global_index+=1
return (global_index,x)
# read in data and operate on it
data = sc.textFile("/data.txt")
...some preprocessing...
data.map(generateKeys)
And it seemed to not recognize the existence of the global variable.
Is there an easy way that comes to mind to do this?
Thanks, Jack
回答1:
>>> lsts = [
... [1, 1, 1, 1, 1, 1],
... [1, 1, 1, 1, 1, 1],
... [1, 1, 1, 1, 1, 1],
... [1, 1, 1, 1, 1, 1],
... [1, 1, 1, 1, 1, 1],
... [1, 1, 1, 1, 1, 1],
... [1, 1, 1, 1, 1, 2],
... [1, 1, 1, 2, 1, 2]
... ]
...
>>> list(enumerate(lsts))
[(0, [1, 1, 1, 1, 1, 1]),
(1, [1, 1, 1, 1, 1, 1]),
(2, [1, 1, 1, 1, 1, 1]),
(3, [1, 1, 1, 1, 1, 1]),
(4, [1, 1, 1, 1, 1, 1]),
(5, [1, 1, 1, 1, 1, 1]),
(6, [1, 1, 1, 1, 1, 2]),
(7, [1, 1, 1, 2, 1, 2])]
enumerate
generates unique index for each item in the iterable and yields tuples with values (index, original_item)
If you want to start numbering with other than 0
, pass the starting value to enumerate
as second parameter.
>>> list(enumerate(lsts, 1))
[(1, [1, 1, 1, 1, 1, 1]),
(2, [1, 1, 1, 1, 1, 1]),
(3, [1, 1, 1, 1, 1, 1]),
(4, [1, 1, 1, 1, 1, 1]),
(5, [1, 1, 1, 1, 1, 1]),
(6, [1, 1, 1, 1, 1, 1]),
(7, [1, 1, 1, 1, 1, 2]),
(8, [1, 1, 1, 2, 1, 2])]
Note, that the list
is used to get real values out from enumerate
which is iterator and not a function, returning lists.
Alternative: globally available id assigner
enumerate
is easy to use, but if you would need to assing id in diferrent pieces of your code, it
would become difficult or impossible. For such a case, globally available generator (as drafter in
OP) would be the way to go.
itertools
provide count
which can serve our need:
>>> from itertools import count
>>> idgen = count()
Now we have (globally available) idgen
generator ready to yield unique ids.
We can test it by a function prid
(print id):
>>> def prid():
... id = idgen.next()
... print id
...
>>> prid()
0
>>> prid()
1
>>> prid()
2
>>> prid()
3
As it works we can test it on list of values:
>>> lst = ['100', '101', '102', '103', '104', '105', '106', '107', '108', '109']
and define actual function, which when called with a value would return tuple (id, value)
>>> def assignId(val):
... return (idgen.next(), val)
...
note, that there is no need to declare idgen
as global as we are not going to change it's value (the idgen
will only change it's internal status when called, but will still remain the same generator).
Test, if it works:
>>> assignId("ahahah")
(4, 'ahahah')
and try it on the list:
>>> map(assignId, lst)
[(5, '100'),
(6, '101'),
(7, '102'),
(8, '103'),
(9, '104'),
(10, '105'),
(11, '106'),
(12, '107'),
(13, '108'),
(14, '109')]
The main diferrence to enumerate
solution is, we can assign ids one by one anywhere in the code
without doing it all from within one all processing enumerate
.
>>> assignId("lonely line")
(15, 'lonely line')
回答2:
try dataRdd.zipWithIndex
and eventually swap the resulting tuple if having the index first is a must.
来源:https://stackoverflow.com/questions/24689363/spark-using-iterator-lambda-function-in-rdd-map