How to parallelize reading lines from an input file when lines get independently processed?

China☆狼群 提交于 2019-12-08 20:53:53

问题


I just started off with OpenMP using C++. My serial code in C++ looks something like this:

#include <iostream>
#include <string>
#include <sstream>
#include <vector>
#include <fstream>
#include <stdlib.h>

int main(int argc, char* argv[]) {
    string line;
    std::ifstream inputfile(argv[1]);

    if(inputfile.is_open()) {
        while(getline(inputfile, line)) {
            // Line gets processed and written into an output file
        }
    }
}

Because each line is pretty much independently processed, I was attempting to use OpenMP to parallelize this because the input file is in the order of gigabytes. So I'm guessing that first I need to get the number of lines in the input file and then parallelize the code this way. Can someone please help me out here?

#include <iostream>
#include <string>
#include <sstream>
#include <vector>
#include <fstream>
#include <stdlib.h>

#ifdef _OPENMP
#include <omp.h>
#endif

int main(int argc, char* argv[]) {
    string line;
    std::ifstream inputfile(argv[1]);

    if(inputfile.is_open()) {
        //Calculate number of lines in file?
        //Set an output filename and open an ofstream
        #pragma omp parallel num_threads(8)
        {
            #pragma omp for schedule(dynamic, 1000)
            for(int i = 0; i < lines_in_file; i++) {
                 //What do I do here? I cannot just read any line because it requires random access
            }
        }
    }
}

EDIT:

Important Things

  1. Each line is independently processed
  2. Order of the results don't matter

回答1:


Not a direct OpenMP answer - but what you are probably looking for is Map/Reduce approach. Take a look at Hadoop - it's done in Java, but there's some C++ API at least.

In general, you want to process this amount of data on different machines, not in multiple threads in the same process (virtual address space limitations, lack of physical memory, swapping, etc.) Also the kernel will have to bring the disk file in sequentially anyway (which you want - otherwise the hard-drive will just have to do extra seeks for each of your threads).



来源:https://stackoverflow.com/questions/3860316/how-to-parallelize-reading-lines-from-an-input-file-when-lines-get-independently

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!