While looking around for a while I found quite a few discussions on how to figure out the number of lines in a file.
For example these three:
c# how do I count l
Yes, reading lines like that is the fastest and easiest way in any practical sense.
There are no shortcuts here. Files are not line based, so you have to read every single byte from the file to determine how many lines there are.
As TomTom pointed out, creating the strings is not strictly needed to count the lines, but a vast majority of the time spent will be waiting for the data to be read from the disk. Writing a much more complicated algorithm would perhaps shave off a percent of the execution time, and it would dramatically increase the time for writing and testing the code.
StreamReader
is not the fastest way to read files in general because of the small overhead from encoding the bytes to characters, so reading the file in a byte array is faster.
The results I get are a bit different each time due to caching and other processes, but here is one of the results I got (in milliseconds) with a 16 MB file :
75 ReadLines
82 ReadLine
22 ReadAllBytes
23 Read 32K
21 Read 64K
27 Read 128K
In general File.ReadLines
should be a little bit slower than a StreamReader.ReadLine
loop.
File.ReadAllBytes
is slower with bigger files and will throw out of memory exception with huge files.
The default buffer size for FileStream
is 4K, but on my machine 64K seemed the fastest.
private static int countWithReadLines(string filePath)
{
int count = 0;
var lines = File.ReadLines(filePath);
foreach (var line in lines) count++;
return count;
}
private static int countWithReadLine(string filePath)
{
int count = 0;
using (var sr = new StreamReader(filePath))
while (sr.ReadLine() != null)
count++;
return count;
}
private static int countWithFileStream(string filePath, int bufferSize = 1024 * 4)
{
using (var fs = new FileStream(filePath, FileMode.Open, FileAccess.Read))
{
int count = 0;
byte[] array = new byte[bufferSize];
while (true)
{
int length = fs.Read(array, 0, bufferSize);
for (int i = 0; i < length; i++)
if(array[i] == 10)
count++;
if (length < bufferSize) return count;
}
} // end of using
}
and tested with:
var path = "1234567890.txt"; Stopwatch sw; string s = "";
File.WriteAllLines(path, Enumerable.Repeat("1234567890abcd", 1024 * 1024 )); // 16MB (16 bytes per line)
sw = Stopwatch.StartNew(); countWithReadLines(path) ; sw.Stop(); s += sw.ElapsedMilliseconds + " ReadLines \n";
sw = Stopwatch.StartNew(); countWithReadLine(path) ; sw.Stop(); s += sw.ElapsedMilliseconds + " ReadLine \n";
sw = Stopwatch.StartNew(); countWithReadAllBytes(path); sw.Stop(); s += sw.ElapsedMilliseconds + " ReadAllBytes \n";
sw = Stopwatch.StartNew(); countWithFileStream(path, 1024 * 32); sw.Stop(); s += sw.ElapsedMilliseconds + " Read 32K \n";
sw = Stopwatch.StartNew(); countWithFileStream(path, 1024 * 64); sw.Stop(); s += sw.ElapsedMilliseconds + " Read 64K \n";
sw = Stopwatch.StartNew(); countWithFileStream(path, 1024 *128); sw.Stop(); s += sw.ElapsedMilliseconds + " Read 128K \n";
MessageBox.Show(s);
I tried multiple methods and tested their performance:
The one that reads a single byte is about 50% slower than the other methods. The other methods all return around the same amount of time. You could try creating threads and doing this asynchronously, so while you are waiting for a read you can start processing a previous read. That sounds like a headache to me.
I would go with the one liner: File.ReadLines(filePath).Count();
it performs as well as the other methods I tested.
private static int countFileLines(string filePath)
{
using (StreamReader r = new StreamReader(filePath))
{
int i = 0;
while (r.ReadLine() != null)
{
i++;
}
return i;
}
}
private static int countFileLines2(string filePath)
{
using (Stream s = new FileStream(filePath, FileMode.Open))
{
int i = 0;
int b;
b = s.ReadByte();
while (b >= 0)
{
if (b == 10)
{
i++;
}
b = s.ReadByte();
}
return i + 1;
}
}
private static int countFileLines3(string filePath)
{
using (Stream s = new FileStream(filePath, FileMode.Open))
{
int i = 0;
byte[] b = new byte[bufferSize];
int n = 0;
n = s.Read(b, 0, bufferSize);
while (n > 0)
{
i += countByteLines(b, n);
n = s.Read(b, 0, bufferSize);
}
return i + 1;
}
}
private static int countByteLines(byte[] b, int n)
{
int i = 0;
for (int j = 0; j < n; j++)
{
if (b[j] == 10)
{
i++;
}
}
return i;
}
private static int countFileLines4(string filePath)
{
return File.ReadLines(filePath).Count();
}
The best way to know how to do this fast is to think about the fastest way to do it without using C/C++.
In assembly there is a CPU level operation that scans memory for a character so in assembly you would do the following
So, in C# you want the compiler to get as close to that as possible.
public static int CountLines(Stream stm)
{
StreamReader _reader = new StreamReader(stm);
int c = 0, count = 0;
while ((c = _reader.Read()) != -1)
{
if (c == '\n')
{
count++;
}
}
return count;
}
No, it is not. Point is - it materializes the strings, which is not needed.
To COUNT it you are much better off to ignore the "string" Part and to go the "line" Part.
a LINE is a seriees of bytes ending with \r\n (13, 10 - CR LF) or another marker.
Just run along the bytes, in a buffered stream, counting the number of appearances of your end of line marker.