问题
How to parallelize the below code, number of elements in the attributes column is nearly 15 and so combination is taking more time.
combs = set()
for L in range(0,len(attributes)+1):
combs.add(itertools.combinations(attributes,L))
Any way to parallelize it using multiprocessing?
I tried this, but i am getting this error. - if chunksize <= 0:
TypeError: unorderable types: range() <= int()
import itertools
from multiprocessing import Pool
def comb(attributes):
res = itertools.combinations(attributes)
return res
def main():
p = Pool(4)
times = range(0,len(attributes)+1)
values = p.map(comb,attributes,times)
p.close()
p.join()
print(values)
if __name__ == '__main__':
attributes =('Age', 'Workclass', 'Fnlwgt', 'Education', 'Education-num', 'marital-status', 'Occupation', 'Relationship', 'Race', 'Sex', 'Capital-gain', 'Capital-loss', 'Hours-per-week', 'Native country', 'Probability', 'Id')
main()
Since it was asked to explain the question, here you go... I am trying to get the combination without replacement. Basically n!. For example, if i have A,B,C in my attributes variable, i am trying to get (A),(B),(C),(A,B),(A,C),(A,B,C). Since the number of elements in attributes is not static and it changes based on input dataset, i cant hard code it. So, i am using len(attributes) here, where attributes will store the attributes from the dataset. Then to create combination, itertools.combinations(attributes,L) will generally create all combinations of length L. In my example, if i give length(attributes), then i will get only ABC, not other combinations. So i created a range for length and adding one with it to tackle the zeroth element.
Now coming back to the problem, i may get 15 elements in my dataset and so length(attributes) will be 15 which is 15!. This combination generation takes lot of time as it has to do this factorial. So i am thinking to parallelize this in a way that each processor handles one combination set generation at a time, for example one processor will be generating all the combination of length 2 and other of length 3, etc... But in pool map i am not able to properly pass more than one argument. Hope this clears the situation, let me know if further explanation needed.
回答1:
There are several issues with your multiprocessing code that mean it won't work like your single-process version.
To start with, you're not calling p.map
properly. The map
method's arguments are the function to call, the argument (a single sequence), and a chunk size, specifying how many values to pass to a worker at a time. You're passing a range
object as chunksize
, which is the immediate cause of your error.
If you try to fix that, though, you'll find other issues. For instance, you're passing attributes
to map
in such a way that it will pass just a single value on to each worker process, rather than the whole list of attributes. And your comb
function returns an iterator over the combination values, not the values themselves (so the worker will complete more or less instantly, but return something that can't be usefully printed out).
Here's what I believe to be working code:
import itertools
from multiprocessing import Pool
# attributes is always accessible as a global, so worker processes can directly access it
attributes = ('Age', 'Workclass', 'Fnlwgt', 'Education', 'Education-num',
'marital-status', 'Occupation', 'Relationship', 'Race', 'Sex',
'Capital-gain', 'Capital-loss', 'Hours-per-week', 'Native country',
'Probability', 'Id')
def comb(n): # the argument n is the number of items to select
res = list(itertools.combinations(attributes, n)) # create a list from the iterator
return res
def main():
p = Pool(4)
times = range(0, len(attributes)+1)
values = p.map(comb, times) # pass the range as the sequence of arguments!
p.close()
p.join()
print(values)
if __name__ == '__main__':
main()
This code still takes a while to complete if the list of attributes is large, but that's simply because there are an awful lot of values to print out (the powerset of a set with n
values has 2^n
subsets). My IDE told me that the output was over 88000 lines (it thankfully didn't show them all). It wouldn't be surprised if the multiprocessing part of the problem is less of an issue as the output part!
来源:https://stackoverflow.com/questions/25909189/parallelizing-combinations-python