std::ifstream buffer caching

In my application I'm trying to merge sorted files (keeping them sorted of course), so I have to iterate through each element in both files to write the minimal to the third one. This works pretty much slow on big files, as far as I don't see any other choice (the iteration has to be done) I'm trying to optimize file loading. I can use some amount of RAM, which I can use for buffering. I mean instead of reading 4 bytes from both files every time I can read once something like 100Mb and work with that buffer after that, until there will be no element in buffer, then I'll refill the buffer again. But I guess ifstream is already doing that, will it give me more performance and is there any reason? If fstream does, maybe I can change size of that buffer?

added

My current code looks like that (pseudocode)

// this is done in loop
int i1 = input1.read_integer();
int i2 = input2.read_integer();
if (!input1.eof() && !input2.eof())
{
   if (i1 < i2)
   {
      output.write(i1);
      input2.seek_back(sizeof(int));
   } else
      input1.seek_back(sizeof(int));
      output.write(i2);
   }
} else {
   if (input1.eof())
      output.write(i2);
   else if (input2.eof())
      output.write(i1);
}

What I don't like here is

seek_back - I have to seek back to previous position as there is no way to peek 4 bytes
too much reading from file
if one of the streams is in EOF it still continues to check that stream instead of putting contents of another stream directly to output, but this is not a big issue, because chunk sizes are almost always equal.

Can you suggest improvement for that?

Thanks.

davka

Without getting into the discussion on stream buffers, you can get rid of the seek_back and generally make the code much simpler by doing:

using namespace std;
merge(istream_iterator<int>(file1), istream_iterator<int>(),
           istream_iterator<int>(file2), istream_iterator<int>(),
           ostream_iterator<int>(cout));

Edit:

Added binary capability

#include <algorithm>
#include <iterator>
#include <fstream>
#include <iostream>

struct BinInt
{
    int value;
    operator int() const { return value; }
    friend std::istream& operator>>(std::istream& stream, BinInt& data)
    {
        return stream.read(reinterpret_cast<char*>(&data.value),sizeof(int));
    }
};

int main()
{
    std::ifstream   file1("f1.txt");
    std::ifstream   file2("f2.txt");

    std::merge(std::istream_iterator<BinInt>(file1), std::istream_iterator<BinInt>(),
               std::istream_iterator<BinInt>(file2), std::istream_iterator<BinInt>(),
               std::ostream_iterator<int>(std::cout));
}

In decreasing order of performance (best first):

memory-mapped I/O
OS-specific ReadFile or read calls.
fread into a large buffer
ifstream.read into a large buffer
ifstream and extractors

Mike Dunlavey

A program like this should be I/O bound, meaning it should be spending at least 80% of it's time waiting for completion of reading or writing a buffer, and if the buffers are reasonably big, it should be keeping the disk heads busy. That's what you want.

Don't assume it is I/O bound, without proof. A way to prove it is by taking several stackshots. If it is, most of the samples will show the program waiting for I/O completion.

It is possible that it is not I/O bound, meaning you may find other things going on in some of the samples that you never expected. If so, then you know what to fix to speed it up. I have seen some code like this spending much more time than necessary in the merge loop, testing for end-of-file, getting data to compare, etc. for example.

You can just use the read function of an ifstream to read large blocks.

http://www.cplusplus.com/reference/iostream/istream/read/

The second parameter is the number of bytes. You should make this a multiple of 4 in your case - maybe 4096? :)

Simply read a chunk at a time and work on it.

As martin-york said, this may not have any beneficial effect on your performance, but try it and find out.

I think it is very likely that you can improve performance by reading big chunks.

Try opening the file with ios::binary as an argument, then use istream::read to read the data.

If you need maximum performance, I would actually suggest skipping iostreams altogether, and using cstdio instead. But I guess this is not what you want.

Unless there is something very special about your data it is unlikely that you will improve on the buffering that is built into the std::fstream object.

The std::fstream objects are designed to be very effecient for general purpose file access. It does not sound like you are doing anything special by accessing the data 4 bytes at a time. You can always profile your code to see where the actual time is spent in your code.

Maybe if you share the code with ous we could spot some major inefficiencies.

Edit:

I don't like your algorithm. Seeking back and forward may be hard on the stream especially of the number lies over a buffer boundary. I would only read one number each time through the loop.

Try this:
Note: This is not optimal (and it assumes stream input of numbers (while yours looks binary)) But I am sure you can use it as a starting point.

#include <fstream>
#include <iostream>

// Return the current val (that was the smaller value)
// and replace it with the next value in the stream.
int getNext(int& val, std::istream& str)
{
    int result = val;
    str >> val;

    return result;
}

int main()
{
    std::ifstream   f1("f1.txt");
    std::ifstream   f2("f2.txt");
    std::ofstream   re("result");

    int v1;
    int v2;

    f1 >> v1;
    f2 >> v2;

    // While there are values in both stream
    // Output one value and replace it using getNext()
    while(f1 && f2)
    {
        re << (v1 < v2)? getNext(v1, f1) : getNext(v2, f2);
    }
    // At this point one (or both) stream(s) is(are) empty.
    // So dump the other stream.
    for(;f1;f1 >> v1)
    {
        // Note if the stream is at the end it will
        // never enter the loop
        re << v1;
    }
    for(;f2;f2 >> v2)
    {
        re << v2;
    }
}

来源：https://stackoverflow.com/questions/4558068/stdifstream-buffer-caching

标签

c++

performance

file

sorting

buffering