I\'ve always assumed it to be more efficient, when processing text files, to first read the contents (or part of it) into an std::string or char array, as — from my limited
A great deal here depends on exactly how critical performance really is for you/your application. That, in turn tends to depend upon how large of files you're dealing with -- if you're dealing with something like tens or hundreds of kilobytes, you should generally just write the simplest code that will work, and not worry much about it -- anything you can do is going to be essentially instantaneous, so optimizing the code won't really accomplish much.
On the other hand, if you're processing a lot of data -- on the order of tens of megabytes or more, differences in efficiency can become fairly substantial. Unless you take fairly specific steps to bypass it (such as using read
) all your reads are going to be buffered -- but that does not mean they'll all be the same speed (or necessarily even very close to the same speed).
For example, let's try a quick test of a few different methods for doing essentially what you've asked about:
#include
#include
#include
#include
#include
#include
#include
#include
unsigned count1(FILE *infile, char c) {
int ch;
unsigned count = 0;
while (EOF != (ch=getc(infile)))
if (ch == c)
++count;
return count;
}
unsigned int count2(FILE *infile, char c) {
static char buffer[4096];
int size;
unsigned int count = 0;
while (0 < (size = fread(buffer, 1, sizeof(buffer), infile)))
for (int i=0; i(infile),
std::istreambuf_iterator(), c);
}
unsigned count4(std::istream &infile, char c) {
return std::count(std::istream_iterator(infile),
std::istream_iterator(), c);
}
template
void timer(F f, T &t, std::string const &title) {
unsigned count;
clock_t start = clock();
count = f(t, 'N');
clock_t stop = clock();
std::cout << std::left << std::setw(30) << title << "\tCount: " << count;
std::cout << "\tTime: " << double(stop-start)/CLOCKS_PER_SEC << "\n";
}
int main() {
char const *name = "test input.txt";
FILE *infile=fopen(name, "r");
timer(count1, infile, "ignore");
rewind(infile);
timer(count1, infile, "using getc");
rewind(infile);
timer(count2, infile, "using fread");
fclose(infile);
std::ifstream in2(name);
in2.sync_with_stdio(false);
timer(count3, in2, "ignore");
in2.clear();
in2.seekg(0);
timer(count3, in2, "using streambuf iterators");
in2.clear();
in2.seekg(0);
timer(count4, in2, "using stream iterators");
return 0;
}
I ran this with a file of approximately 44 megabytes as input. When compiled with VC++2012, I got the following results:
ignore Count: 400000 Time: 2.08
using getc Count: 400000 Time: 2.034
using fread Count: 400000 Time: 0.257
ignore Count: 400000 Time: 0.607
using streambuf iterators Count: 400000 Time: 0.608
using stream iterators Count: 400000 Time: 5.136
Using the same input, but compiled with g++ 4.7.1:
ignore Count: 400000 Time: 0.359
using getc Count: 400000 Time: 0.339
using fread Count: 400000 Time: 0.243
ignore Count: 400000 Time: 0.697
using streambuf iterators Count: 400000 Time: 0.694
using stream iterators Count: 400000 Time: 1.612
So, even though all the reads are buffered, we're seeing a variation of about 8:1 with g++ and about 20:1 with VC++. Of course, I haven't tested (even close to) every possible way of reading the input. I doubt we'd see a lot wider range of times if we tested more techniques for reading, but I could be wrong about that. Whether we do or not, we're seeing enough variation that at least if you're processing a lot of data, you could well be justified in choosing one technique over another to improve processing speed.