I need to read the number of lines in a file before doing some operations on that file. When I try to read the file and increment the line_count variable at each iteration u
Don't use C++ stl strings and getline
( or C's fgets), just C style raw pointers and either block read in page-size chunks or mmap the file.
Then scan the block at the native word size of your system ( ie either uint32_t
or uint64_t
) using one of the magic algorithms 'SIMD Within A Register (SWAR) Operations' for testing the bytes within the word. An example is here; the loop with the 0x0a0a0a0a0a0a0a0aLL
in it scans for line breaks. ( that code gets to around 5 cycles per input byte matching a regex on each line of a file )
If the file is only a few tens or a hundred or so megabytes, and it keeps growing (ie something keeps writing to it), then there's a good likelihood that linux has it cached in memory, so it won't be disk IO limited, but memory bandwidth limited.
If the file is only ever being appended to, you could also remember the number of lines and previous length, and start from there.
It has been pointed out that you could use mmap with C++ stl algorithms, and create a functor to pass to std::foreach. I suggested that you shouldn't do it not because you can't do it that way, but there is no gain in writing the extra code to do so. Or you can use boost's mmapped iterator, which handles it all for you; but for the problem the code I linked to was written for this was much, much slower, and the question was about speed not style.
The thing that takes time is loading 40+ MB into memory. The fastest way to do that is to either memorymap it, or load it in one go into a big buffer. Once you have it in memory, one way or another, a loop traversing the data looking for \n
characters is almost instantaneous, no matter how it is implemented.
So really, the most important trick is to load the file into memory as fast as possible. And the fastest way to do that is to do it as a single operation.
Otherwise, plenty of tricks may exist to speed up the algorithm. If lines are only added, never modified or removed, and if you're reading the file repeatedly, you can cache the lines read previously, and the next time you have to read the file, only read the newly added lines.
Or perhaps you can maintain a separate index file showing the location of known '\n' characters, so those parts of the file can be skipped over.
Reading large amounts of data from the harddrive is slow. There's no way around that.