fread slow performance in OpenMP threads

问题

I use Intel Xeon x2 (24 kernels) and Windows Server 2008.
Trying to parallelize my c++ program. Template code here:

vector< string > files;
vector< vector< float > > data; 
...
data.resize( files.size() ); 

#pragma omp parallel for 
for (int i=0; i<files.size(); i++) { // Files count is about 3000
    FILE *f = fopen(files[i].c_str(), "rb"); 

    // every file is about 40 mb
    data[i].resize(someSize);
    fread(&data[i][0], sizeof(float), someSize, f); 

    fclose(f);
    ...
    performCalculations();  
}

CPU Usage is only from 0 to 5%.
When I insert instead of fread(&data[i][0], sizeof(float), someSize, f):

for (int j=0; j<data.size(); j++) {
    data[i][j] = rand(); 
}

CPU Usage increases to 100%.
I already tried to use fstream and WinApi ReadFile, but it didn't take an big effect.

What am I doing wrong? I don't believe that the disk reading can be so slow...

回答1:

I don't believe that the disk reading can be so slowly...

Then you better start believing. Disks are incredibly slow compared to CPUs. Parallel I/O usually only helps when you're reading from multiple sources such as separate disks or network connections. It can solve latency problems well, but not bandwidth problems.

Trying reading in all your data in one go, serially, and then processing it in a parallelized loop.

回答2:

Disk readings cannot be parallelized*: whether you have 1 or 24 CPU cores won't change your disk I/O throughput.

If one performCalculations(); call is faster than reading the content of one of your 40 MB files, then there's no need to parallelize on several CPU. Your program execution is limited by your disk bandwidth. Have you measured this?

*: They can, but require hardware. Just like parallelizing execution on multiple CPU require actual multiple CPU hardware, parallelizing disk I/O require more disk I/O hardware.

回答3:

If you are using a conventional HDD, there won't be any visible speedups because there would be many concurrent file reads. A HDD mostly can't handle such current file reading. That is why you only have 0-5% CPU usages, which means most of parallel loops just wait for the file operations. (Note that disk readings can be parallelized so long as multiple file readings are on different physical disks or platters.)

There are a couple of solutions:

Try to use a SSD that can support much better random/concurrent accesses.
Although it's not easy to explain everything in this answer, try to use pipeline parallelism. OpenMP isn't good for pipelining, but TBB supports an easy-to-use pipeline template. Pipeline allows the file read step and the other calculation steps, so you could have a decent speedup. Of course, there should be enough computation.

来源：https://stackoverflow.com/questions/8121077/fread-slow-performance-in-openmp-threads

标签

c++

performance

file

openmp