Parallelize brute force generation

后端 未结 2 1451
深忆病人
深忆病人 2021-01-26 02:50

I\'ve searched a lot for this, but was unable to find something I could grasp my head around. What I\'m searching for is how to make this algorithm parallel. It doesn\'t matter

2条回答
  •  梦谈多话
    2021-01-26 03:45

    First thing you have to do is get disenchanted with 1-liner disease ;-) That is, parallel processing adds many complexities of its own, so you want to keep your code as simple and transparent as possible. Here's a way that generalizes @BasSwinckels's suggestion. It's not short! But it's very effective: it will nail your CPU meter to the wall no matter how many cores you have.

    CHARSET = "abcdefghijklmnopqrstuvwxyx"
    MAX_LENGTH = 6  # generate all strings from CHARSET with length 1 thru MAX_LENGTH
    NUM_PROCESSES = None # defaults to all available cores
    
    from itertools import product
    
    # string_gen is what the workers run.  Everything else
    # runs in the main program.
    def string_gen(prefix, suffix_len, length):
        # Generate all strings of length `length` starting with `prefix`.
        # If length > suffix_len, only the last suffix_len characters
        # need to be generated.
        num_done = 0
        if length <= suffix_len:
            assert prefix == ""
            for t in product(CHARSET, repeat=length):
                result = "".join(t)
                # do something with result
                num_done += 1
        else:
            assert len(prefix) + suffix_len == length
            for t in product(CHARSET, repeat=suffix_len):
                result = prefix + "".join(t)
                # do something with result
                num_done += 1
        return num_done
    
    def record_done(num):
        global num_done
        num_done += num
        print num_done, "done so far"
    
    def do_work(pool, strings_per_chunk=1000000):
        # What's the most chars we can cycle through without
        # exceeding strings_per_chunk?  Could do with this
        # logs, but I'm over-reacting to 1-liner disease ;-)
        N = len(CHARSET)
        suffix_len = 1
        while N**suffix_len <= strings_per_chunk:
            suffix_len += 1
        suffix_len -= 1
        print "workers will cycle through the last", suffix_len, "chars"
    
        # There's no point to splitting up very short strings.
        max_short_len = min(suffix_len, MAX_LENGTH)
        for length in range(1, max_short_len + 1):
            pool.apply_async(string_gen, args=("", suffix_len, length),
                             callback=record_done)
        # And now the longer strings.
        for length in range(max_short_len + 1, MAX_LENGTH + 1):
            for t in product(CHARSET, repeat=length-suffix_len):
                prefix = "".join(t)
                pool.apply_async(string_gen, args=(prefix, suffix_len, length),
                                 callback=record_done)
    
    if __name__ == "__main__":
        import multiprocessing
        pool = multiprocessing.Pool(NUM_PROCESSES)
        num_done = 0
        do_work(pool)
        pool.close()
        pool.join()
        expected = sum(len(CHARSET)**i
                       for i in range(1, MAX_LENGTH + 1))
        assert num_done == expected, (num_done, expected)
    

    There are multiple pieces to this because what you want is "lumpy": strings of a variety of sizes. Parallel gimmicks are usually easier the more utterly uniform the structure of the problem. But lumpiness can be handled - it just takes more code.

    Do note the assert statements, and the "useless" computations of num_done. Parallel code adds brand new dimensions of complexity, so do yourself a favor and code defensively from the start. You're going to try many things that simply won't work - that happens to everyone.

    Note too that breaking free of 1-liner disease also gives a more efficient approach, even without multiple cores: computing and joining the prefix just once for the longer strings will save countless billions of redundant joins over the course of longer runs.

    Have fun :-)

提交回复
热议问题