What is the fastest/most efficient way to loop through a large collection of files and save a plot of the data?

问题

So I have this program that is looping through about 2000+ data files, performing a fourier transform, plotting the transform, then saving the figure. It feels like the longer the program runs, the slower it seems to get. Is there anyway to make it run faster or cleaner with a simple change in the code below?

Previously, I had the fourier transform defined as a function, but I read somewhere here that python has high function calling overhead, so i did away with the function and am running straight through now. Also, I read that the clf() has a steady log of previous figures that can get quite large and slow down the process if you loop through a lot of plots, so I've changed that to close(). Where these good changes as well?

from numpy import *
from pylab import *

for filename in filelist:

    t,f = loadtxt(filename, unpack=True)

    dt = t[1]-t[0]
    fou = absolute(fft.fft(f))
    frq = absolute(fft.fftfreq(len(t),dt))

    ymax = median(fou)*30

    figure(figsize=(15,7))
    plot(frq,fou,'k')

    xlim(0,400)
    ylim(0,ymax)

    iname = filename.replace('.dat','.png')
    savefig(iname,dpi=80)
    close()

回答1:

Have you considered using the multiprocessing module to parallelize processing the files? Assuming that you're actually CPU-bound here (meaning it's the fourier transform that's eating up most of the running time, not reading/writing the files), that should speed up execution time without actually needing to speed up the loop itself.

Edit:

For example, something like this (untested, but should give you the idea):

def do_transformation(filename)
    t,f = loadtxt(filename, unpack=True)

    dt = t[1]-t[0]
    fou = absolute(fft.fft(f))
    frq = absolute(fft.fftfreq(len(t),dt))

    ymax = median(fou)*30

    figure(figsize=(15,7))
    plot(frq,fou,'k')

    xlim(0,400)
    ylim(0,ymax)

    iname = filename.replace('.dat','.png')
    savefig(iname,dpi=80)
    close()

pool = multiprocessing.Pool(multiprocessing.cpu_count())
for filename in filelist:
    pool.apply_async(do_transformation, (filename,))
pool.close()
pool.join()

You may need to tweak what work actually gets done in the worker processes. Trying to parallelize the disk I/O portions may not help you much (or even hurt you), for example.

回答2:

Yes, adding close was a good move. It should help plug the memory leak you had. I'd also recommend moving the figure, plotting, and close commands outside the loop - just update the Line2D instance created by plot. Check out this for more info.

Note: I think this should work, but I haven't tested it here.

回答3:

I tested something similar to what you are doing in ipython and I noticed that the loop got considerably slower when a directory had a lot of files in it. It seems like the file system in that directory has an overhead relating to the number of files in that folder, maybe relating to the lookup time of:

loadtxt(filename, unpack = true)

You could try splitting where you save your plots into chuncks by splitting your filelist into smaller chunks and saving in a different directory for each one.

来源：https://stackoverflow.com/questions/23572714/what-is-the-fastest-most-efficient-way-to-loop-through-a-large-collection-of-fil

标签

python

matplotlib

fft

figure