Reading a large number of files quickly

问题

I have a large number of (>100k) relatively small files (1kb - 300kb) that I need to read in and process. I'm currently looping through all the files and using File.ReadAllText to read the content, processing it, and then reading the next file. This is quite slow and I was wondering if there is a good way to optimize it.

I have already tried using multiple threads but as this seems to be IO bound I didn't see any improvements.

回答1:

You're most likely correct - Reading that many files is probably going to limit your potential speedups since the Disk I/O will be the limiting factor.

That being said, you very likely can do a small improvement by passing the processing of the data into a separate thread.

I would recommend trying to have a single "producer" thread that reads your files. This thread will be IO limited. As it reads a file, it can push the "processing" into a ThreadPool thread (.NET 4 tasks work great for this too) in order to do the processing, which would allow it to immediately read the next file.

This will at least take the "processing time" out of the total runtime, making the total time for your job nearly as fast as the Disk IO, provided you've got an extra core or two to work with...

回答2:

What I would do is do the processing in a seperate thread. I would read in a file and store the data in queue, then read in the next file and so forth.

In your second thread, have the thread read the data from that queue and process it. See if that helps!

回答3:

It is probably disk seek time which is the limiting factor (this is one of the commonest bottlenecks when doing Make, which usually involves lots of small files). Dumb file system designs have a directory entry and insist on a pointer to the disk blocks for a file, and that gaurantees a minimum of 1 seek per file.

If you are using Windows, I'd switch to using NTFS (which stores small files in the directory entry (--> save one disk seek per file). We use disk compression, too, (more computation but CPUs are cheap and fast but less disk space --> less read time); this may not be relevant if your files are all small. There may be a Linux file system equivalent, if that's where you are.

Yes, you should launch a bunch of threads to read the files:

     forall filename in list:   fork( open filename, process file, close filename)

You might have to throttle this to prevent running out of threads, but I'd shoot for hundreds not 2 or 3. If you do that, you're telling the OS that it can read lots of places on the disk, and it can order the multiple requests by disk placement (elevator algorithm), and that will help minimize head motion, too.

回答4:

I would recommend "MultiThreading" to solve this problem. When I read your post answers, suddenly found that Reed Copsey`s answer will be so productive. You can found a sample for this solution which prepared by Elmue on this link. I hope this can be useful and thanks to Reed Copsey. Regards

回答5:

I agree with Reed's and Icemanind's comments. In addition, consider how to increase disk IO. For example, spread the files over multiple disks so they can be read in parallel and use faster disks such as SSDs or maybe a RAM disk.

来源：https://stackoverflow.com/questions/3205898/reading-a-large-number-of-files-quickly

标签

.net-2.0