Read txt file with multi-threaded in python

后端 未结 2 1888
庸人自扰
庸人自扰 2020-12-15 05:44

I\'m trying to read a file in python (scan it lines and look for terms) and write the results- let say, counters for each term. I need to do that for a big amount of files (

2条回答
  •  独厮守ぢ
    2020-12-15 06:19

    I agree with @aix, multiprocessing is definitely the way to go. Regardless you will be i/o bound -- you can only read so fast, no matter how many parallel processes you have running. But there can easily be some speedup.

    Consider the following (input/ is a directory that contains several .txt files from Project Gutenberg).

    import os.path
    from multiprocessing import Pool
    import sys
    import time
    
    def process_file(name):
        ''' Process one file: count number of lines and words '''
        linecount=0
        wordcount=0
        with open(name, 'r') as inp:
            for line in inp:
                linecount+=1
                wordcount+=len(line.split(' '))
    
        return name, linecount, wordcount
    
    def process_files_parallel(arg, dirname, names):
        ''' Process each file in parallel via Poll.map() '''
        pool=Pool()
        results=pool.map(process_file, [os.path.join(dirname, name) for name in names])
    
    def process_files(arg, dirname, names):
        ''' Process each file in via map() '''
        results=map(process_file, [os.path.join(dirname, name) for name in names])
    
    if __name__ == '__main__':
        start=time.time()
        os.path.walk('input/', process_files, None)
        print "process_files()", time.time()-start
    
        start=time.time()
        os.path.walk('input/', process_files_parallel, None)
        print "process_files_parallel()", time.time()-start
    

    When I run this on my dual core machine there is a noticeable (but not 2x) speedup:

    $ python process_files.py
    process_files() 1.71218085289
    process_files_parallel() 1.28905105591
    

    If the files are small enough to fit in memory, and you have lots of processing to be done that isn't i/o bound, then you should see even better improvement.

提交回复
热议问题