Count lines in large files

前端 未结 13 2275
挽巷
挽巷 2020-12-02 08:53

I commonly work with text files of ~20 Gb size and I find myself counting the number of lines in a given file very often.

The way I do it now it\'s just cat fn

13条回答
  •  借酒劲吻你
    2020-12-02 09:28

    As per my test, I can verify that the Spark-Shell (based on Scala) is way faster than the other tools (GREP, SED, AWK, PERL, WC). Here is the result of the test that I ran on a file which had 23782409 lines

    time grep -c $ my_file.txt;
    

    real 0m44.96s user 0m41.59s sys 0m3.09s

    time wc -l my_file.txt;
    

    real 0m37.57s user 0m33.48s sys 0m3.97s

    time sed -n '$=' my_file.txt;
    

    real 0m38.22s user 0m28.05s sys 0m10.14s

    time perl -ne 'END { $_=$.;if(!/^[0-9]+$/){$_=0;};print "$_" }' my_file.txt;

    real 0m23.38s user 0m20.19s sys 0m3.11s

    time awk 'END { print NR }' my_file.txt;
    

    real 0m19.90s user 0m16.76s sys 0m3.12s

    spark-shell
    import org.joda.time._
    val t_start = DateTime.now()
    sc.textFile("file://my_file.txt").count()
    val t_end = DateTime.now()
    new Period(t_start, t_end).toStandardSeconds()
    

    res1: org.joda.time.Seconds = PT15S

提交回复
热议问题