Gzip issue with multiprocessing pool

白昼怎懂夜的黑 提交于 2021-02-08 08:48:22

问题


I have a gzip file handle that I'm writing to from a multiprocessing pool. Unfortunately, the output file seems to become corrupted after a certain point, so doing something like zcat out | wc gives:

gzip: out: invalid compressed data--format violated

I'm dealing with the problem by not using gzip. But I'm curious as to why this is happening and if there is any solution.

Not sure if it matters, but I'm running the code on a remote linux machine that I don't control but my guess is that it's an ubuntu machine. Python 2.7.3

And here's the slightly simplified code:

lock = Lock()
ohandle = gzip.open("out", "w")
def process(fn):
  rv = []
  for l in open(fn):
    sometext = dosomething(l)
    rv.append(sometext)


  lock.acquire()
  for sometext in rv:
    print >> ohandle, sometext
  lock.release()

pool = Pool(processes=4)
pm = pool.map(process, some_file_list])
ohandle.close()

回答1:


See http://docs.python.org/2/library/multiprocessing.html#programming-guidelines

  • You should guard calling part with "if name == main...". Or that part will be run by child process.
  • Explicitly pass resources to child processes. (ohandle, lock)

I modified your code to not use lock and not to share ohandle. Instead I used temporary file. (fn + '.temp')

Caution: You should check filenames. If there is any file with '.temp' suffix, my code could delete your data.


import os


def process(fn):
    out_fn = fn + '.temp'
    with open(fn) as f, open(out_fn, 'w') as f2:
        for l in f:
            sometext = dosomething(l)
            print >> f2, sometext
    return out_fn

if __name__ == '__main__':
    some_file_list = ...
    pool = Pool(processes=4)

    ohandle = gzip.open('out.gz', 'w')
    for fn in pool.map(process, some_file_list):
        with open(fn) as f:
            while True:
                data = f.read(1<<12)
                if not data: break
                ohandle.write(data)
        os.unlink(fn)
    pool.close()
    pool.join()


来源:https://stackoverflow.com/questions/17016029/gzip-issue-with-multiprocessing-pool

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!