how to count the total number of lines in a text file using python

前端 未结 11 1822
你的背包
你的背包 2020-12-01 04:44

For example if my text file is:

blue
green
yellow
black

Here there are four lines and now I want to get the result as four. How can I do th

11条回答
  •  南笙
    南笙 (楼主)
    2020-12-01 05:25

    I am not new to stackoverflow, just never had an account and usually came here for answers. I can't comment or vote up an answer yet. BUT wanted to say that the code from Michael Bacon above works really well. I am new to Python but not to programming. I have been reading Python Crash Course and there are a few things I wanted to do to break up the reading cover to cover approach. One utility that has uses from an ETL or even data quality perspective would be to capture the row count of a file independently from any ETL. The file has X number of rows, you import into SQL or Hadoop and you end up with X number of rows. You can validate at the lowest level the row count of a raw data file.

    I have been playing with his code and doing some testing and this code is very efficient so far. I have created several different CSV files, various sizes, and row counts. You can see my code below and my comments provide the times and details. The code Michael Bacon above provided runs about 6 times faster than the normal Python method of just looping the lines.

    Hope this helps someone.


     import time
    from itertools import (takewhile,repeat)
    
    def readfilesimple(myfile):
    
        # watch me whip
        linecounter = 0
        with open(myfile,'r') as file_object:
            # watch me nae nae
             for lines in file_object:
                linecounter += 1
    
        return linecounter
    
    def readfileadvanced(myfile):
    
        # watch me whip
        f = open(myfile, 'rb')
        # watch me nae nae
        bufgen = takewhile(lambda x: x, (f.raw.read(1024 * 1024) for _ in repeat(None)))
        return sum(buf.count(b'\n') for buf in bufgen if buf)
        #return linecounter
    
    
    # ************************************
    # Main
    # ************************************
    
    #start the clock
    
    start_time = time.time()
    
    # 6.7 seconds to read a 475MB file that has 24 million rows and 3 columns
    #mycount = readfilesimple("c:/junk/book1.csv")
    
    # 0.67 seconds to read a 475MB file that has 24 million rows and 3 columns
    #mycount = readfileadvanced("c:/junk/book1.csv")
    
    # 25.9 seconds to read a 3.9Gb file that has 3.25 million rows and 104 columns
    #mycount = readfilesimple("c:/junk/WideCsvExample/ReallyWideReallyBig1.csv")
    
    # 5.7 seconds to read a 3.9Gb file that has 3.25 million rows and 104 columns
    #mycount = readfileadvanced("c:/junk/WideCsvExample/ReallyWideReallyBig1.csv")
    
    
    # 292.92 seconds to read a 43Gb file that has 35.7 million rows and 104 columns
    mycount = readfilesimple("c:/junk/WideCsvExample/ReallyWideReallyBig.csv")
    
    # 57 seconds to read a 43Gb file that has 35.7 million rows and 104 columns
    #mycount = readfileadvanced("c:/junk/WideCsvExample/ReallyWideReallyBig.csv")
    
    
    #stop the clock
    elapsed_time = time.time() - start_time
    
    
    print("\nCode Execution: " + str(elapsed_time) + " seconds\n")
    print("File contains: " + str(mycount) + " lines of text.")
    

提交回复
热议问题