Accessing individual characters in a file inefficient? (C++)

后端未结

关注

 3  1631

梦谈多话 2020-12-31 19:13

I\'ve always assumed it to be more efficient, when processing text files, to first read the contents (or part of it) into an std::string or char array, as — from my limited

3条回答

轻奢々 (楼主)

2020-12-31 19:40

A great deal here depends on exactly how critical performance really is for you/your application. That, in turn tends to depend upon how large of files you're dealing with -- if you're dealing with something like tens or hundreds of kilobytes, you should generally just write the simplest code that will work, and not worry much about it -- anything you can do is going to be essentially instantaneous, so optimizing the code won't really accomplish much.

On the other hand, if you're processing a lot of data -- on the order of tens of megabytes or more, differences in efficiency can become fairly substantial. Unless you take fairly specific steps to bypass it (such as using read) all your reads are going to be buffered -- but that does not mean they'll all be the same speed (or necessarily even very close to the same speed).

For example, let's try a quick test of a few different methods for doing essentially what you've asked about:

#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 

unsigned count1(FILE *infile, char c) { 
    int ch;
    unsigned count = 0;

    while (EOF != (ch=getc(infile)))
        if (ch == c)
            ++count;
    return count;
}

unsigned int count2(FILE *infile, char c) { 
    static char buffer[4096];
    int size;
    unsigned int count = 0;

    while (0 < (size = fread(buffer, 1, sizeof(buffer), infile)))
        for (int i=0; i(infile), 
                    std::istreambuf_iterator(), c);
}

unsigned count4(std::istream &infile, char c) {    
    return std::count(std::istream_iterator(infile), 
                    std::istream_iterator(), c);
}

template 
void timer(F f, T &t, std::string const &title) { 
    unsigned count;
    clock_t start = clock();
    count = f(t, 'N');
    clock_t stop = clock();
    std::cout << std::left << std::setw(30) << title << "\tCount: " << count;
    std::cout << "\tTime: " << double(stop-start)/CLOCKS_PER_SEC << "\n";
}

int main() {
    char const *name = "test input.txt";

    FILE *infile=fopen(name, "r");

    timer(count1, infile, "ignore");

    rewind(infile);
    timer(count1, infile, "using getc");

    rewind(infile);
    timer(count2, infile, "using fread");

    fclose(infile);

    std::ifstream in2(name);
    in2.sync_with_stdio(false);
    timer(count3, in2, "ignore");

    in2.clear();
    in2.seekg(0);
    timer(count3, in2, "using streambuf iterators");

    in2.clear();
    in2.seekg(0);
    timer(count4, in2, "using stream iterators");

    return 0;
}

I ran this with a file of approximately 44 megabytes as input. When compiled with VC++2012, I got the following results:

ignore                          Count: 400000   Time: 2.08
using getc                      Count: 400000   Time: 2.034
using fread                     Count: 400000   Time: 0.257
ignore                          Count: 400000   Time: 0.607
using streambuf iterators       Count: 400000   Time: 0.608
using stream iterators          Count: 400000   Time: 5.136

Using the same input, but compiled with g++ 4.7.1:

ignore                          Count: 400000   Time: 0.359
using getc                      Count: 400000   Time: 0.339
using fread                     Count: 400000   Time: 0.243
ignore                          Count: 400000   Time: 0.697
using streambuf iterators       Count: 400000   Time: 0.694
using stream iterators          Count: 400000   Time: 1.612

So, even though all the reads are buffered, we're seeing a variation of about 8:1 with g++ and about 20:1 with VC++. Of course, I haven't tested (even close to) every possible way of reading the input. I doubt we'd see a lot wider range of times if we tested more techniques for reading, but I could be wrong about that. Whether we do or not, we're seeing enough variation that at least if you're processing a lot of data, you could well be justified in choosing one technique over another to improve processing speed.

0 讨论(0)

查看其它3个回答