How to multiprocess, multithread a big file by dividing into small chunks based on values of a particular column?

╄→尐↘猪︶ㄣ 提交于 2019-12-02 10:35:14

An issue to handle when using multiple processes to handle data, is to preserve order. Python has come up with a rather nice way of handling this, using a multiprocessing.Pool, which can be used to map the processes over the input data. This will then take care of returning the results in order.

However, the processing may still be out of order, so to use it properly, only processing, and no IO access should be run in the subprocesses. Therefore, to use this in your case, a small rewrite of your code needs to be performed, that have all IO operations happening in the main process:

from multiprocessing import Pool
from time import sleep
from random import randint

my_data = '''chr\tpos\tidx\tvals
2\t23\t4\tabcd
2\t25\t7\tatg
2\t29\t8\tct
2\t35\t1\txylfz
3\t37\t2\tmnost
3\t39\t3\tpqr
3\t41\t6\trtuv
3\t45\t5\tlfghef
3\t39\t3\tpqr
3\t41\t6\trtu
3\t45\t5\tlfggg
4\t25\t3\tpqrp
4\t32\t6\trtu
4\t38\t5\tlfgh
4\t51\t3\tpqr
4\t57\t6\trtus
'''

def manipulate_lines(vals):
    sleep(randint(0, 2))
    vals_len = len(vals[3])
    return vals[0:3], vals_len

def write_to_file(a, b):
    print(a,b)
    to_file = open('write_multiprocessData.txt', 'a')
    to_file.write('\t'.join(['\t'.join(a), str(b), '\n']))
    to_file.close()

def line_generator(data):
    for line in data:
        if line.startswith('chr'):
            continue
        else:
           yield line.split('\t')

def main():
    p = Pool(5)

    to_file = open('write_multiprocessData.txt', 'w')
    to_file.write('\t'.join(['chr', 'pos', 'idx', 'vals', '\n']))
    to_file.close()

    data = my_data.rstrip('\n').split('\n')

    lines = line_generator(data)
    results = p.map(manipulate_lines, lines)

    for result in results:
        write_to_file(*result)

if __name__ == '__main__':
    main()

This program does not split the list after its different chr values, but instead it processes entry by entry, directly from the list in maximally 5 (argument to Pool) sub-processes.

To show that the data is still in the expected order, I added a random sleep delay to the manipulate_lines function. This shows the concept but may not give a correct view of the speedup, since a sleeping process allows another one to run in parallel, whereas a compute-heavy process will use the CPU for all of its run time.

As can be seen, the writing to file has then to be done, once the map call returns, which assures that all subprocesses has been terminated and returned their results. There is quite some overhead for this communication behind the scene, so for this to be beneficial, the compute part must be substantially longer than the write phase, and it must not generate too much data to write to file.

In addition, I have also broken out the for loop in a generator. This is so that input to the multiprocessing.Pool is available upon request. Another way would be to pre-process the data list and then pass that list directly to the Pool. I find the generator solution to be nicer, though, and have smaller peak memory consumption.

Also, a comment on multithreading vs multiprocessing; as long as you do compute-heavy operations, you should use multiprocessing, which, at least in theory, allows the processes to run on different machines. In addition, in cPython - the most used Python implementation - threads hit upon another issue, which is the global interpreter lock (GIL).This means that only one thread can execute at a time, since the interpreter blocks access for all other threads. (There are some exceptions, e.g. when using modules written in C, like numpy. In these cases the GIL can be released while doing numpy calculations, but in general this is not the case.) Thus, threads are mainly for situations where your program is stuck waiting for slow, out-of-order, IO. (Sockets, terminal input, et c.)

I've only used threading a few times, and I have not tested this code below, but from a quick glance, the for loop is really the only place that could benefit from threading.

I'll let other people decide though.

import threading

my_data = '''chr\tpos\tidx\tvals
2\t23\t4\tabcd
2\t25\t7\tatg
2\t29\t8\tct
2\t35\t1\txylfz
3\t37\t2\tmnost
3\t39\t3\tpqr
3\t41\t6\trtuv
3\t45\t5\tlfghef
3\t39\t3\tpqr
3\t41\t6\trtu
3\t45\t5\tlfggg
4\t25\t3\tpqrp
4\t32\t6\trtu
4\t38\t5\tlfgh
4\t51\t3\tpqr
4\t57\t6\trtus
'''


def manipulate_lines(vals):
    vals_len = len(vals[3])
    return write_to_file(vals[0:3], vals_len)

def write_to_file(a, b):
    print(a,b)
    to_file = open('write_multiprocessData.txt', 'a')
    to_file.write('\t'.join(['\t'.join(a), str(b), '\n']))
    to_file.close()

def main():
    to_file = open('write_multiprocessData.txt', 'w')
    to_file.write('\t'.join(['chr', 'pos', 'idx', 'vals', '\n']))
    to_file.close()

    data = my_data.rstrip('\n').split('\n')

    for lines in data:
        if not lines.startswith('chr'):
            lines = lines.split('\t')
        threading.Thread(target = manipulate_lines, args = (lines)).start()


if __name__ == '__main__':
    main()
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!