How to count lines fast?

前端 未结 6 1390
感情败类
感情败类 2020-12-09 15:34

I tried unxutils\' wc -l but it crashed for 1GB files. I tried this C# code

long count = 0;
using (StreamReader r = new StreamReader(f))
{
    st         


        
相关标签:
6条回答
  • 2020-12-09 15:46

    I think that your answer looks good. The only thing I would add is to play with buffer size. I feel that it may change the performance depending on your buffer size.

    Please refer to buffer size at - Optimum file buffer read size?

    0 讨论(0)
  • 2020-12-09 15:48

    If you really want fast, consider C code.

    If this is a command-line utility, it will be faster because it won't have to initialize the CLR or .NET. And, it won't reallocate a new string for each line read from the file, which probably saves time on throughput.

    I don't have any files with 1g lines, so I cannot compare. you can try, though:

    /*
     * LineCount.c
     *
     * count lines...
     *
     * compile with: 
     *
     *  c:\vc10\bin\cl.exe /O2 -Ic:\vc10\Include -I\winsdk\Include 
     *          LineCount.c -link /debug /SUBSYSTEM:CONSOLE /LIBPATH:c:\vc10\Lib
     *          /LIBPATH:\winsdk\Lib /out:LineCount.exe
     */
    
    #include <stdio.h>
    #include <string.h>
    #include <stdlib.h>
    
    
    void Usage(char *appname)
    {
        printf("\nLineCount.exe\n");
        printf("  count lines in a text file...\n\n");
        printf("usage:\n");
        printf("  %s <filename>\n\n", appname);
    }
    
    
    
    int linecnt(char *file)
    {
        int sz = 2048;
        char *buf = (char *) malloc(sz);
        FILE *fp = NULL;
        int n= 0;
        errno_t rc = fopen_s(&fp, file, "r");
    
        if (rc) {
            fprintf(stderr, "%s: fopen(%s) failed: ecode(%d)\n",
                    __FILE__, file, rc);
            return -1;
        }
    
        while (fgets(buf, sz, fp)){
            int r = strlen(buf);
            if (buf[r-1] == '\n')
                n++;
            // could re-alloc here to handle larger lines
        }
        fclose(fp);
        return n;
    }
    
    int main(int argc, char **argv)
    {
        if (argc==2) {
            int n = linecnt (argv[1]);
            printf("Lines: %d\n", n);
        }
        else {
            Usage(argv[0]);
            exit(1);
        }
    }
    
    0 讨论(0)
  • 2020-12-09 15:52

    Are you just looking for a tool to count lines in a file, and efficiently? If so, try MS LogParser

    Something like below will give you number of lines:

    LogParser "SELECT count(*) FROM file" -i:TEXTLINE
    
    0 讨论(0)
  • 2020-12-09 15:55

    Have you tried flex?

    %{
    long num_lines = 0;
    %}
    %option 8bit outfile="scanner.c"
    %option nounput nomain noyywrap
    %option warn
    
    %%
    .+ { }
    \n { ++num_lines; }
    %%
    int main(int argc, char **argv);
    
    int main (argc,argv)
    int argc;
    char **argv;
    {
    yylex();
    printf( "# of lines = %d\n", num_lines );
    return 0;
    }
    

    Just compile with:

    flex -Cf scanner.l 
    gcc -O -o lineCount.exe scanner.c
    

    It accepts input on stdin and outputs the number of lines.

    0 讨论(0)
  • 2020-12-09 16:00

    File.ReadLines was introduced in .NET 4.0

    var count = File.ReadLines(file).Count();
    

    works in 4 seconds, the same time as the first code snippet

    0 讨论(0)
  • 2020-12-09 16:01

    Your first approach does look like the optimal solution already. Keep in mind that you're mostly not CPU bound but limited by the HD's read speed, which at 500MB / 4sec = 125MB/s is already quite fast. The only way to get faster than that is via RAID or using SSDs, not so much via a better algorithm.

    0 讨论(0)
提交回复
热议问题