I need to read the number of lines in a file before doing some operations on that file. When I try to read the file and increment the line_count variable at each iteration u
There's a difference between counting lines and counting line separators. Some common gotchas to watch out for if getting an exact line count is important:
What's the file encoding? The byte-by-byte solutions will work for ASCII and UTF-8, but watch out if you have UTF-16 or some multibyte encoding that doesn't guarantee that a byte with the value of a line feed necessarily encodes a line feed.
Many text files don't have a line separator at the end of the last line. So if your file says "Hello, World!"
, you could end up with a count of 0 instead of 1. Rather than just counting the line separators, you'll need a simple state machine to keep track.
Some very obscure files use Unicode U+2028 LINE SEPARATOR
(or even U+2029 PARAGRAPH SEPARATOR
) as line separators instead of the more common carriage return and/or line feed. You might also want to watch out for U+0085 NEXT LINE (NEL)
.
You'll have to consider whether you want to count some other control characters as line breakers. For example, should a U+000C FORM FEED
or U+000B LINE TABULATION
(a.k.a. vertical tab) be considered going to a new line?
Text files from older versions of Mac OS (before OS X) use carriage returns (U+000D
) rather than line feeds (U+000A
) to separate lines. If you're reading the raw bytes into a buffer (e.g., with your stream in binary mode) and scanning them, you'll come up with a count of 0 on these files. You can't count both carriage returns and line feeds, because PC files generally end a line with both. Again, you'll need a simple state machine. (Alternatively, you can read the file in text mode rather than binary mode. The text interfaces will normalize line separators to '\n'
for files that conform to the convention used on your platform. If you're reading files from other platforms, you'll be back to binary mode with a state machine.)
If you ever have a super long line in the file, the getline()
approach can throw an exception causing your simple line counter to fail on a small number of files. (This is particularly true if you're reading an old Mac file on a non-Mac platform, causing getline()
to see the entire file as one gigantic line.) By reading chunks into a fixed-size buffer and using a state machine, you can make it bullet proof.
The code in the accepted answer suffers from most of these traps. Make it right before you make it fast.