Count how many records are in a CSV Python?

后端 未结 16 1522
无人共我
无人共我 2020-11-29 16:43

I\'m using python (Django Framework) to read a CSV file. I pull just 2 lines out of this CSV as you can see. What I have been trying to do is store in a variable the total n

16条回答
  •  余生分开走
    2020-11-29 17:16

    2018-10-29 EDIT

    Thank you for the comments.

    I tested several kinds of code to get the number of lines in a csv file in terms of speed. The best method is below.

    with open(filename) as f:
        sum(1 for line in f)
    

    Here is the code tested.

    import timeit
    import csv
    import pandas as pd
    
    filename = './sample_submission.csv'
    
    def talktime(filename, funcname, func):
        print(f"# {funcname}")
        t = timeit.timeit(f'{funcname}("{filename}")', setup=f'from __main__ import {funcname}', number = 100) / 100
        print('Elapsed time : ', t)
        print('n = ', func(filename))
        print('\n')
    
    def sum1forline(filename):
        with open(filename) as f:
            return sum(1 for line in f)
    talktime(filename, 'sum1forline', sum1forline)
    
    def lenopenreadlines(filename):
        with open(filename) as f:
            return len(f.readlines())
    talktime(filename, 'lenopenreadlines', lenopenreadlines)
    
    def lenpd(filename):
        return len(pd.read_csv(filename)) + 1
    talktime(filename, 'lenpd', lenpd)
    
    def csvreaderfor(filename):
        cnt = 0
        with open(filename) as f:
            cr = csv.reader(f)
            for row in cr:
                cnt += 1
        return cnt
    talktime(filename, 'csvreaderfor', csvreaderfor)
    
    def openenum(filename):
        cnt = 0
        with open(filename) as f:
            for i, line in enumerate(f,1):
                cnt += 1
        return cnt
    talktime(filename, 'openenum', openenum)
    

    The result was below.

    # sum1forline
    Elapsed time :  0.6327946722068599
    n =  2528244
    
    
    # lenopenreadlines
    Elapsed time :  0.655304473598555
    n =  2528244
    
    
    # lenpd
    Elapsed time :  0.7561274056295324
    n =  2528244
    
    
    # csvreaderfor
    Elapsed time :  1.5571560935772661
    n =  2528244
    
    
    # openenum
    Elapsed time :  0.773000013928679
    n =  2528244
    

    In conclusion, sum(1 for line in f) is fastest. But there might not be significant difference from len(f.readlines()).

    sample_submission.csv is 30.2MB and has 31 million characters.

提交回复
热议问题