I\'ve always assumed it to be more efficient, when processing text files, to first read the contents (or part of it) into an std::string or char array, as — from my limited
A great deal here depends on exactly how critical performance really is for you/your application. That, in turn tends to depend upon how large of files you're dealing with -- if you're dealing with something like tens or hundreds of kilobytes, you should generally just write the simplest code that will work, and not worry much about it -- anything you can do is going to be essentially instantaneous, so optimizing the code won't really accomplish much.
On the other hand, if you're processing a lot of data -- on the order of tens of megabytes or more, differences in efficiency can become fairly substantial. Unless you take fairly specific steps to bypass it (such as using read) all your reads are going to be buffered -- but that does not mean they'll all be the same speed (or necessarily even very close to the same speed).
For example, let's try a quick test of a few different methods for doing essentially what you've asked about:
#include
#include
#include
#include
#include
#include
#include
#include
unsigned count1(FILE *infile, char c) {
int ch;
unsigned count = 0;
while (EOF != (ch=getc(infile)))
if (ch == c)
++count;
return count;
}
unsigned int count2(FILE *infile, char c) {
static char buffer[4096];
int size;
unsigned int count = 0;
while (0 < (size = fread(buffer, 1, sizeof(buffer), infile)))
for (int i=0; i(infile),
std::istreambuf_iterator(), c);
}
unsigned count4(std::istream &infile, char c) {
return std::count(std::istream_iterator(infile),
std::istream_iterator(), c);
}
template
void timer(F f, T &t, std::string const &title) {
unsigned count;
clock_t start = clock();
count = f(t, 'N');
clock_t stop = clock();
std::cout << std::left << std::setw(30) << title << "\tCount: " << count;
std::cout << "\tTime: " << double(stop-start)/CLOCKS_PER_SEC << "\n";
}
int main() {
char const *name = "test input.txt";
FILE *infile=fopen(name, "r");
timer(count1, infile, "ignore");
rewind(infile);
timer(count1, infile, "using getc");
rewind(infile);
timer(count2, infile, "using fread");
fclose(infile);
std::ifstream in2(name);
in2.sync_with_stdio(false);
timer(count3, in2, "ignore");
in2.clear();
in2.seekg(0);
timer(count3, in2, "using streambuf iterators");
in2.clear();
in2.seekg(0);
timer(count4, in2, "using stream iterators");
return 0;
}
I ran this with a file of approximately 44 megabytes as input. When compiled with VC++2012, I got the following results:
ignore Count: 400000 Time: 2.08
using getc Count: 400000 Time: 2.034
using fread Count: 400000 Time: 0.257
ignore Count: 400000 Time: 0.607
using streambuf iterators Count: 400000 Time: 0.608
using stream iterators Count: 400000 Time: 5.136
Using the same input, but compiled with g++ 4.7.1:
ignore Count: 400000 Time: 0.359
using getc Count: 400000 Time: 0.339
using fread Count: 400000 Time: 0.243
ignore Count: 400000 Time: 0.697
using streambuf iterators Count: 400000 Time: 0.694
using stream iterators Count: 400000 Time: 1.612
So, even though all the reads are buffered, we're seeing a variation of about 8:1 with g++ and about 20:1 with VC++. Of course, I haven't tested (even close to) every possible way of reading the input. I doubt we'd see a lot wider range of times if we tested more techniques for reading, but I could be wrong about that. Whether we do or not, we're seeing enough variation that at least if you're processing a lot of data, you could well be justified in choosing one technique over another to improve processing speed.