Read a big file by lines in C++

耗尽温柔 提交于 2019-12-08 10:19:49

问题


I have a big file nearly 800M, and I want to read it line by line.

At first I wrote my program in Python, I use linecache.getline:

lines = linecache.getlines(fname)

It costs about 1.2s.

Now I want to transplant my program to C++.

I wrote these code:

    std::ifstream DATA(fname);
    std::string line;
    vector<string> lines;

    while (std::getline(DATA, line)){
        lines.push_back(line);
    }

But it's slow(costs minutes). How to improve it?

  • Joachim Pileborg mentioned mmap(), and on windows CreateFileMapping() will work.

My code runs under VS2013, when I use "DEBUG" mode, it takes 162 seconds;

When I use "RELEASE" mode, only 7 seconds!

(Great Thanks To @DietmarKühl and @Andrew)


回答1:


First of all, you should probably make sure you are compiling with optimizations enabled. This might not matter for such a simple algorithm, but that really depends on your vector/string library implementations.

As suggested by @angew, std::ios_base::sync_with_stdio(false) makes a big difference on routines like the one you have written.

Another, lesser, optimization would be to use lines.reserve() to preallocate your vector so that push_back() doesn't result in huge copy operations. However, this is most useful if you happen to know in advance approximately how many lines you are likely to receive.

Using the optimizations suggested above, I get the following results for reading an 800MB text stream:

 20 seconds ## if average line length = 10 characters
  3 seconds ## if average line length = 100 characters
  1 second  ## if average line length = 1000 characters

As you can see, the speed is dominated by per-line overhead. This overhead is primarily occurring inside the std::string class.

It is likely that any approach based on storing a large quantity of std::string will be suboptimal in terms of memory allocation overhead. On a 64-bit system, std::string will require a minimum of 16 bytes of overhead per string. In fact, it is very possible that the overhead will be significantly greater than that -- and you could find that memory allocation (inside of std::string) becomes a significant bottleneck.

For optimal memory use and performance, consider writing your own routine that reads the file in large blocks rather than using getline(). Then you could apply something similar to the flyweight pattern to manage the indexing of the individual lines using a custom string class.

P.S. Another relevant factor will be the physical disk I/O, which might or might not be bypassed by caching.




回答2:


For c++ you could try something like this:

void processData(string str)
{
  vector<string> arr;
  boost::split(arr, str, boost::is_any_of(" \n"));
  do_some_operation(arr);
}

int main()
{
 unsigned long long int read_bytes = 45 * 1024 *1024;
 const char* fname = "input.txt";
 ifstream fin(fname, ios::in);
 char* memblock;

 while(!fin.eof())
 {
    memblock = new char[read_bytes];
    fin.read(memblock, read_bytes);
    string str(memblock);
    processData(str);
    delete [] memblock;
 }
 return 0;
}


来源:https://stackoverflow.com/questions/32006765/read-a-big-file-by-lines-in-c

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!