parallelizing combinations python [closed]

纵然是瞬间 提交于 2019-12-25 03:45:42

问题


How to parallelize the below code, number of elements in the attributes column is nearly 15 and so combination is taking more time.

combs = set()
for L in range(0,len(attributes)+1):
    combs.add(itertools.combinations(attributes,L))

Any way to parallelize it using multiprocessing?

I tried this, but i am getting this error. - if chunksize <= 0:

TypeError: unorderable types: range() <= int()

import itertools     
from multiprocessing import Pool
def comb(attributes):
    res = itertools.combinations(attributes)
    return res

def main():
    p = Pool(4)
    times = range(0,len(attributes)+1)                                                
    values = p.map(comb,attributes,times)
    p.close()
    p.join()
    print(values)

if __name__ == '__main__':
    attributes =('Age', 'Workclass', 'Fnlwgt', 'Education', 'Education-num', 'marital-status', 'Occupation', 'Relationship', 'Race', 'Sex', 'Capital-gain', 'Capital-loss', 'Hours-per-week', 'Native country', 'Probability', 'Id')
    main()

Since it was asked to explain the question, here you go... I am trying to get the combination without replacement. Basically n!. For example, if i have A,B,C in my attributes variable, i am trying to get (A),(B),(C),(A,B),(A,C),(A,B,C). Since the number of elements in attributes is not static and it changes based on input dataset, i cant hard code it. So, i am using len(attributes) here, where attributes will store the attributes from the dataset. Then to create combination, itertools.combinations(attributes,L) will generally create all combinations of length L. In my example, if i give length(attributes), then i will get only ABC, not other combinations. So i created a range for length and adding one with it to tackle the zeroth element.

Now coming back to the problem, i may get 15 elements in my dataset and so length(attributes) will be 15 which is 15!. This combination generation takes lot of time as it has to do this factorial. So i am thinking to parallelize this in a way that each processor handles one combination set generation at a time, for example one processor will be generating all the combination of length 2 and other of length 3, etc... But in pool map i am not able to properly pass more than one argument. Hope this clears the situation, let me know if further explanation needed.


回答1:


There are several issues with your multiprocessing code that mean it won't work like your single-process version.

To start with, you're not calling p.map properly. The map method's arguments are the function to call, the argument (a single sequence), and a chunk size, specifying how many values to pass to a worker at a time. You're passing a range object as chunksize, which is the immediate cause of your error.

If you try to fix that, though, you'll find other issues. For instance, you're passing attributes to map in such a way that it will pass just a single value on to each worker process, rather than the whole list of attributes. And your comb function returns an iterator over the combination values, not the values themselves (so the worker will complete more or less instantly, but return something that can't be usefully printed out).

Here's what I believe to be working code:

import itertools     
from multiprocessing import Pool

# attributes is always accessible as a global, so worker processes can directly access it
attributes = ('Age', 'Workclass', 'Fnlwgt', 'Education', 'Education-num',
              'marital-status', 'Occupation', 'Relationship', 'Race', 'Sex',
              'Capital-gain', 'Capital-loss', 'Hours-per-week', 'Native country',
              'Probability', 'Id')

def comb(n): # the argument n is the number of items to select
    res = list(itertools.combinations(attributes, n)) # create a list from the iterator
    return res

def main():
    p = Pool(4)
    times = range(0, len(attributes)+1)                                                
    values = p.map(comb, times) # pass the range as the sequence of arguments!
    p.close()
    p.join()
    print(values)

if __name__ == '__main__':
    main()

This code still takes a while to complete if the list of attributes is large, but that's simply because there are an awful lot of values to print out (the powerset of a set with n values has 2^n subsets). My IDE told me that the output was over 88000 lines (it thankfully didn't show them all). It wouldn't be surprised if the multiprocessing part of the problem is less of an issue as the output part!



来源:https://stackoverflow.com/questions/25909189/parallelizing-combinations-python

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!