Speed up integer reading from file in C++

I'm reading a file, line by line, and extracting integers from it. Some noteworthy points:

the input file is not in binary;
I cannot load up the whole file in memory;
file format (only integers, separated by some delimiter):
```
x1 x2 x3 x4 ...
y1 y2 y3 ...
z1 z2 z3 z4 z5 ...
...
```

Just to add context, I'm reading the integers, and counting them, using an std::unordered_map<unsigned int, unsinged int>.

Simply looping through lines, and allocating useless stringstreams, like this:

std::fstream infile(<inpath>, std::ios::in);
while (std::getline(infile, line)) {
    std::stringstream ss(line);
}

gives me ~2.7s for a 700MB file.

Parsing each line:

unsigned int item;
std::fstream infile(<inpath>, std::ios::in);
while (std::getline(infile, line)) {
    std::stringstream ss(line);
    while (ss >> item);
}

Gives me ~17.8s for the same file.

If I change the operator to a std::getline + atoi:

unsigned int item;
std::fstream infile(<inpath>, std::ios::in);
while (std::getline(infile, line)) {
    std::stringstream ss(line);
    while (std::getline(ss, token, ' ')) item = atoi(token.c_str());
}

It gives ~14.6s.

Is there anything faster than these approaches? I don't think it's necessary to speed up the file reading, just the parsing itself -- both wouldn't make no harm, though (:

This program

#include <iostream>
int main ()
{
    int num;
    while (std::cin >> num) ;
}

needs about 17 seconds to read a file. This code

#include <iostream>   
int main()
{
    int lc = 0;
    int item = 0;
    char buf[2048];
    do
    {
        std::cin.read(buf, sizeof(buf));
        int k = std::cin.gcount();
        for (int i = 0; i < k; ++i)
        {
            switch (buf[i])
            {
                case '\r':
                    break;
                case '\n':
                    item = 0; lc++;
                    break;
                case ' ':
                    item = 0;
                    break;
                case '0': case '1': case '2': case '3':
                case '4': case '5': case '6': case '7':
                case '8': case '9':
                    item = 10*item + buf[i] - '0';
                    break;
                default:
                    std::cerr << "Bad format\n";
            }    
        }
    } while (std::cin);
}

needs 1.25 seconds for the same file. Make what you want of it...

Streams are slow. If you really want to do stuff fast load the entire file into memory, and parse it in memory. If you really can't load it all into memory, load it in chunks, making those chunks as large as possible, and parse the chunks in memory.

When parsing in memory, replace the spaces and line endings with nulls so you can use atoi to convert to integer as you go.

Oh, and you'll get problems with the end of chunks because you don't know whether the chunk end cuts off a number or not. To solve this easily stop a small distance (16 byte should do) before the chunk end and copy this tail to the start before loading the next chunk after it.

Have you tried input iterators?

It skips the creation of the strings:

std::istream_iterator<int> begin(infile);
std::istream_iterator<int> end;
int item = 0;
while(begin != end)
    item = *begin++;

Why don't you skip the stream and the line buffers and read from the file stream directly?

template<class T, class CharT, class CharTraits>
std::vector<T> read(std::basic_istream<CharT, CharTraits> &in) {
    std::vector<T> ret;
    while(in.good()) {
        T x;
        in >> x;
        if(in.good()) ret.push_back(x);
    }
    return ret;
}

http://ideone.com/FNJKFa

Following up Jack Aidley's answer (can't put code in the comments), here's some pseudo-code:

vector<char> buff( chunk_size );
roffset = 0;
char* chunk = &buff[0];
while( not done with file )
{
    fread( chunk + roffset, ... ); // Read a sizable chunk into memory, filling in after roffset
    roffset = find_last_eol(chunk); // find where the last full line ends
    parse_in_mem( chunk, chunk_size - roffset ); // process up to the last full line
    move_unprocessed_to_front( chunk, roffset ); // don't re-read what's already in mem
}

来源：https://stackoverflow.com/questions/15163751/speed-up-integer-reading-from-file-in-c

标签

c++

performance

parsing

fstream