How to get line count of a large file cheaply in Python?

为君一笑 提交于 2019-11-25 22:20:02

问题


I need to get a line count of a large file (hundreds of thousands of lines) in python. What is the most efficient way both memory- and time-wise?

At the moment I do:

def file_len(fname):
    with open(fname) as f:
        for i, l in enumerate(f):
            pass
    return i + 1

is it possible to do any better?


回答1:


You can't get any better than that.

After all, any solution will have to read the entire file, figure out how many \n you have, and return that result.

Do you have a better way of doing that without reading the entire file? Not sure... The best solution will always be I/O-bound, best you can do is make sure you don't use unnecessary memory, but it looks like you have that covered.




回答2:


One line, probably pretty fast:

num_lines = sum(1 for line in open('myfile.txt'))



回答3:


I believe that a memory mapped file will be the fastest solution. I tried four functions: the function posted by the OP (opcount); a simple iteration over the lines in the file (simplecount); readline with a memory-mapped filed (mmap) (mapcount); and the buffer read solution offered by Mykola Kharechko (bufcount).

I ran each function five times, and calculated the average run-time for a 1.2 million-line text file.

Windows XP, Python 2.5, 2GB RAM, 2 GHz AMD processor

Here are my results:

mapcount : 0.465599966049
simplecount : 0.756399965286
bufcount : 0.546800041199
opcount : 0.718600034714

Edit: numbers for Python 2.6:

mapcount : 0.471799945831
simplecount : 0.634400033951
bufcount : 0.468800067902
opcount : 0.602999973297

So the buffer read strategy seems to be the fastest for Windows/Python 2.6

Here is the code:

from __future__ import with_statement
import time
import mmap
import random
from collections import defaultdict

def mapcount(filename):
    f = open(filename, "r+")
    buf = mmap.mmap(f.fileno(), 0)
    lines = 0
    readline = buf.readline
    while readline():
        lines += 1
    return lines

def simplecount(filename):
    lines = 0
    for line in open(filename):
        lines += 1
    return lines

def bufcount(filename):
    f = open(filename)                  
    lines = 0
    buf_size = 1024 * 1024
    read_f = f.read # loop optimization

    buf = read_f(buf_size)
    while buf:
        lines += buf.count('\n')
        buf = read_f(buf_size)

    return lines

def opcount(fname):
    with open(fname) as f:
        for i, l in enumerate(f):
            pass
    return i + 1


counts = defaultdict(list)

for i in range(5):
    for func in [mapcount, simplecount, bufcount, opcount]:
        start_time = time.time()
        assert func("big_file.txt") == 1209138
        counts[func].append(time.time() - start_time)

for key, vals in counts.items():
    print key.__name__, ":", sum(vals) / float(len(vals))



回答4:


I had to post this on a similar question until my reputation score jumped a bit (thanks to whoever bumped me!).

All of these solutions ignore one way to make this run considerably faster, namely by using the unbuffered (raw) interface, using bytearrays, and doing your own buffering. (This only applies in Python 3. In Python 2, the raw interface may or may not be used by default, but in Python 3, you'll default into Unicode.)

Using a modified version of the timing tool, I believe the following code is faster (and marginally more pythonic) than any of the solutions offered:

def rawcount(filename):
    f = open(filename, 'rb')
    lines = 0
    buf_size = 1024 * 1024
    read_f = f.raw.read

    buf = read_f(buf_size)
    while buf:
        lines += buf.count(b'\n')
        buf = read_f(buf_size)

    return lines

Using a separate generator function, this runs a smidge faster:

def _make_gen(reader):
    b = reader(1024 * 1024)
    while b:
        yield b
        b = reader(1024*1024)

def rawgencount(filename):
    f = open(filename, 'rb')
    f_gen = _make_gen(f.raw.read)
    return sum( buf.count(b'\n') for buf in f_gen )

This can be done completely with generators expressions in-line using itertools, but it gets pretty weird looking:

from itertools import (takewhile,repeat)

def rawincount(filename):
    f = open(filename, 'rb')
    bufgen = takewhile(lambda x: x, (f.raw.read(1024*1024) for _ in repeat(None)))
    return sum( buf.count(b'\n') for buf in bufgen )

Here are my timings:

function      average, s  min, s   ratio
rawincount        0.0043  0.0041   1.00
rawgencount       0.0044  0.0042   1.01
rawcount          0.0048  0.0045   1.09
bufcount          0.008   0.0068   1.64
wccount           0.01    0.0097   2.35
itercount         0.014   0.014    3.41
opcount           0.02    0.02     4.83
kylecount         0.021   0.021    5.05
simplecount       0.022   0.022    5.25
mapcount          0.037   0.031    7.46



回答5:


You could execute a subprocess and run wc -l filename

import subprocess

def file_len(fname):
    p = subprocess.Popen(['wc', '-l', fname], stdout=subprocess.PIPE, 
                                              stderr=subprocess.PIPE)
    result, err = p.communicate()
    if p.returncode != 0:
        raise IOError(err)
    return int(result.strip().split()[0])



回答6:


Here is a python program to use the multiprocessing library to distribute the line counting across machines/cores. My test improves counting a 20million line file from 26 seconds to 7 seconds using an 8 core windows 64 server. Note: not using memory mapping makes things much slower.

import multiprocessing, sys, time, os, mmap
import logging, logging.handlers

def init_logger(pid):
    console_format = 'P{0} %(levelname)s %(message)s'.format(pid)
    logger = logging.getLogger()  # New logger at root level
    logger.setLevel( logging.INFO )
    logger.handlers.append( logging.StreamHandler() )
    logger.handlers[0].setFormatter( logging.Formatter( console_format, '%d/%m/%y %H:%M:%S' ) )

def getFileLineCount( queues, pid, processes, file1 ):
    init_logger(pid)
    logging.info( 'start' )

    physical_file = open(file1, "r")
    #  mmap.mmap(fileno, length[, tagname[, access[, offset]]]

    m1 = mmap.mmap( physical_file.fileno(), 0, access=mmap.ACCESS_READ )

    #work out file size to divide up line counting

    fSize = os.stat(file1).st_size
    chunk = (fSize / processes) + 1

    lines = 0

    #get where I start and stop
    _seedStart = chunk * (pid)
    _seekEnd = chunk * (pid+1)
    seekStart = int(_seedStart)
    seekEnd = int(_seekEnd)

    if seekEnd < int(_seekEnd + 1):
        seekEnd += 1

    if _seedStart < int(seekStart + 1):
        seekStart += 1

    if seekEnd > fSize:
        seekEnd = fSize

    #find where to start
    if pid > 0:
        m1.seek( seekStart )
        #read next line
        l1 = m1.readline()  # need to use readline with memory mapped files
        seekStart = m1.tell()

    #tell previous rank my seek start to make their seek end

    if pid > 0:
        queues[pid-1].put( seekStart )
    if pid < processes-1:
        seekEnd = queues[pid].get()

    m1.seek( seekStart )
    l1 = m1.readline()

    while len(l1) > 0:
        lines += 1
        l1 = m1.readline()
        if m1.tell() > seekEnd or len(l1) == 0:
            break

    logging.info( 'done' )
    # add up the results
    if pid == 0:
        for p in range(1,processes):
            lines += queues[0].get()
        queues[0].put(lines) # the total lines counted
    else:
        queues[0].put(lines)

    m1.close()
    physical_file.close()

if __name__ == '__main__':
    init_logger( 'main' )
    if len(sys.argv) > 1:
        file_name = sys.argv[1]
    else:
        logging.fatal( 'parameters required: file-name [processes]' )
        exit()

    t = time.time()
    processes = multiprocessing.cpu_count()
    if len(sys.argv) > 2:
        processes = int(sys.argv[2])
    queues=[] # a queue for each process
    for pid in range(processes):
        queues.append( multiprocessing.Queue() )
    jobs=[]
    prev_pipe = 0
    for pid in range(processes):
        p = multiprocessing.Process( target = getFileLineCount, args=(queues, pid, processes, file_name,) )
        p.start()
        jobs.append(p)

    jobs[0].join() #wait for counting to finish
    lines = queues[0].get()

    logging.info( 'finished {} Lines:{}'.format( time.time() - t, lines ) )



回答7:


I would use Python's file object method readlines, as follows:

with open(input_file) as foo:
    lines = len(foo.readlines())

This opens the file, creates a list of lines in the file, counts the length of the list, saves that to a variable and closes the file again.




回答8:


def file_len(full_path):
  """ Count number of lines in a file."""
  f = open(full_path)
  nr_of_lines = sum(1 for line in f)
  f.close()
  return nr_of_lines



回答9:


Here is what I use, seems pretty clean:

import subprocess

def count_file_lines(file_path):
    """
    Counts the number of lines in a file using wc utility.
    :param file_path: path to file
    :return: int, no of lines
    """
    num = subprocess.check_output(['wc', '-l', file_path])
    num = num.split(' ')
    return int(num[0])

UPDATE: This is marginally faster than using pure python but at the cost of memory usage. Subprocess will fork a new process with the same memory footprint as the parent process while it executes your command.




回答10:


I got a small (4-8%) improvement with this version which re-uses a constant buffer so it should avoid any memory or GC overhead:

lines = 0
buffer = bytearray(2048)
with open(filename) as f:
  while f.readinto(buffer) > 0:
      lines += buffer.count('\n')

You can play around with the buffer size and maybe see a little improvement.




回答11:


Kyle's answer

num_lines = sum(1 for line in open('my_file.txt'))

is probably best, an alternative for this is

num_lines =  len(open('my_file.txt').read().splitlines())

Here is the comparision of performance of both

In [20]: timeit sum(1 for line in open('Charts.ipynb'))
100000 loops, best of 3: 9.79 µs per loop

In [21]: timeit len(open('Charts.ipynb').read().splitlines())
100000 loops, best of 3: 12 µs per loop



回答12:


A one-line bash solution similar to this answer, using the modern subprocess.check_output function:

def line_count(file):
    return int(subprocess.check_output('wc -l {}'.format(file), shell=True).split()[0])



回答13:


One line solution:

import os
os.system("wc -l  filename")  

My snippet:

>>> os.system('wc -l *.txt')

0 bar.txt
1000 command.txt
3 test_file.txt
1003 total



回答14:


This is the fastest thing I have found using pure python. You can use whatever amount of memory you want by setting buffer, though 2**16 appears to be a sweet spot on my computer.

from functools import partial

buffer=2**16
with open(myfile) as f:
        print sum(x.count('\n') for x in iter(partial(f.read,buffer), ''))

I found the answer here Why is reading lines from stdin much slower in C++ than Python? and tweaked it just a tiny bit. Its a very good read to understand how to count lines quickly, though wc -l is still about 75% faster than anything else.




回答15:


Just to complete the above methods I tried a variant with the fileinput module:

import fileinput as fi   
def filecount(fname):
        for line in fi.input(fname):
            pass
        return fi.lineno()

And passed a 60mil lines file to all the above stated methods:

mapcount : 6.1331050396
simplecount : 4.588793993
opcount : 4.42918205261
filecount : 43.2780818939
bufcount : 0.170812129974

It's a little surprise to me that fileinput is that bad and scales far worse than all the other methods...




回答16:


This code is shorter and clearer. It's probably the best way:

num_lines = open('yourfile.ext').read().count('\n')



回答17:


As for me this variant will be the fastest:

#!/usr/bin/env python

def main():
    f = open('filename')                  
    lines = 0
    buf_size = 1024 * 1024
    read_f = f.read # loop optimization

    buf = read_f(buf_size)
    while buf:
        lines += buf.count('\n')
        buf = read_f(buf_size)

    print lines

if __name__ == '__main__':
    main()

reasons: buffering faster than reading line by line and string.count is also very fast




回答18:


I have modified the buffer case like this:

def CountLines(filename):
    f = open(filename)
    try:
        lines = 1
        buf_size = 1024 * 1024
        read_f = f.read # loop optimization
        buf = read_f(buf_size)

        # Empty file
        if not buf:
            return 0

        while buf:
            lines += buf.count('\n')
            buf = read_f(buf_size)

        return lines
    finally:
        f.close()

Now also empty files and the last line (without \n) are counted.




回答19:


print open('file.txt', 'r').read().count("\n") + 1



回答20:


Simple method:

1)

>>> f = len(open("myfile.txt").readlines())
>>> f

430

2)

>>> f = open("myfile.txt").read().count('\n')
>>> f
430
>>>

3)

num_lines = len(list(open('myfile.txt')))



回答21:


What about this

def file_len(fname):
  counts = itertools.count()
  with open(fname) as f: 
    for _ in f: counts.next()
  return counts.next()



回答22:


count = max(enumerate(open(filename)))[0]




回答23:


def line_count(path):
    count = 0
    with open(path) as lines:
        for count, l in enumerate(lines, start=1):
            pass
    return count



回答24:


If one wants to get the line count cheaply in Python in Linux, I recommend this method:

import os
print os.popen("wc -l file_path").readline().split()[0]

file_path can be both abstract file path or relative path. Hope this may help.




回答25:


How about this?

import fileinput
import sys

counter=0
for line in fileinput.input([sys.argv[1]]):
    counter+=1

fileinput.close()
print counter



回答26:


How about this one-liner:

file_length = len(open('myfile.txt','r').read().split('\n'))

Takes 0.003 sec using this method to time it on a 3900 line file

def c():
  import time
  s = time.time()
  file_length = len(open('myfile.txt','r').read().split('\n'))
  print time.time() - s



回答27:


the result of opening a file is an iterator, which can be converted to a sequence, which has a length:

with open(filename) as f:
   return len(list(f))

this is more concise than your explicit loop, and avoids the enumerate.




回答28:


You can use the os.path module in the following way:

import os
import subprocess
Number_lines = int( (subprocess.Popen( 'wc -l {0}'.format( Filename ), shell=True, stdout=subprocess.PIPE).stdout).readlines()[0].split()[0] )

, where Filename is the absolute path of the file.




回答29:


Another possibility:

import subprocess

def num_lines_in_file(fpath):
    return int(subprocess.check_output('wc -l %s' % fpath, shell=True).strip().split()[0])



回答30:


def count_text_file_lines(path):
    with open(path, 'rt') as file:
        line_count = sum(1 for _line in file)
    return line_count


来源:https://stackoverflow.com/questions/845058/how-to-get-line-count-of-a-large-file-cheaply-in-python

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!