I tried unxutils\' wc -l
but it crashed for 1GB files. I tried this C# code
long count = 0;
using (StreamReader r = new StreamReader(f))
{
st
I think that your answer looks good. The only thing I would add is to play with buffer size. I feel that it may change the performance depending on your buffer size.
Please refer to buffer size at - Optimum file buffer read size?
If you really want fast, consider C code.
If this is a command-line utility, it will be faster because it won't have to initialize the CLR or .NET. And, it won't reallocate a new string for each line read from the file, which probably saves time on throughput.
I don't have any files with 1g lines, so I cannot compare. you can try, though:
/*
* LineCount.c
*
* count lines...
*
* compile with:
*
* c:\vc10\bin\cl.exe /O2 -Ic:\vc10\Include -I\winsdk\Include
* LineCount.c -link /debug /SUBSYSTEM:CONSOLE /LIBPATH:c:\vc10\Lib
* /LIBPATH:\winsdk\Lib /out:LineCount.exe
*/
#include <stdio.h>
#include <string.h>
#include <stdlib.h>
void Usage(char *appname)
{
printf("\nLineCount.exe\n");
printf(" count lines in a text file...\n\n");
printf("usage:\n");
printf(" %s <filename>\n\n", appname);
}
int linecnt(char *file)
{
int sz = 2048;
char *buf = (char *) malloc(sz);
FILE *fp = NULL;
int n= 0;
errno_t rc = fopen_s(&fp, file, "r");
if (rc) {
fprintf(stderr, "%s: fopen(%s) failed: ecode(%d)\n",
__FILE__, file, rc);
return -1;
}
while (fgets(buf, sz, fp)){
int r = strlen(buf);
if (buf[r-1] == '\n')
n++;
// could re-alloc here to handle larger lines
}
fclose(fp);
return n;
}
int main(int argc, char **argv)
{
if (argc==2) {
int n = linecnt (argv[1]);
printf("Lines: %d\n", n);
}
else {
Usage(argv[0]);
exit(1);
}
}
Are you just looking for a tool to count lines in a file, and efficiently? If so, try MS LogParser
Something like below will give you number of lines:
LogParser "SELECT count(*) FROM file" -i:TEXTLINE
Have you tried flex?
%{
long num_lines = 0;
%}
%option 8bit outfile="scanner.c"
%option nounput nomain noyywrap
%option warn
%%
.+ { }
\n { ++num_lines; }
%%
int main(int argc, char **argv);
int main (argc,argv)
int argc;
char **argv;
{
yylex();
printf( "# of lines = %d\n", num_lines );
return 0;
}
Just compile with:
flex -Cf scanner.l
gcc -O -o lineCount.exe scanner.c
It accepts input on stdin and outputs the number of lines.
File.ReadLines
was introduced in .NET 4.0
var count = File.ReadLines(file).Count();
works in 4 seconds, the same time as the first code snippet
Your first approach does look like the optimal solution already. Keep in mind that you're mostly not CPU bound but limited by the HD's read speed, which at 500MB / 4sec = 125MB/s is already quite fast. The only way to get faster than that is via RAID or using SSDs, not so much via a better algorithm.