How do you split reading a large csv file into evenly-sized chunks in Python?

╄→尐↘猪︶ㄣ 提交于 2019-11-27 11:57:51
miku

Just make your reader subscriptable by wrapping it into a list. Obviously this will break on really large files (see alternatives in the Updates below):

>>> reader = csv.reader(open('big.csv', 'rb'))
>>> lines = list(reader)
>>> print lines[:100]
...

Further reading: How do you split a list into evenly sized chunks in Python?


Update 1 (list version): Another possible way would just process each chuck, as it arrives while iterating over the lines:

#!/usr/bin/env python

import csv
reader = csv.reader(open('4956984.csv', 'rb'))

chunk, chunksize = [], 100

def process_chunk(chuck):
    print len(chuck)
    # do something useful ...

for i, line in enumerate(reader):
    if (i % chunksize == 0 and i > 0):
        process_chunk(chunk)
        del chunk[:]
    chunk.append(line)

# process the remainder
process_chunk(chunk)

Update 2 (generator version): I haven't benchmarked it, but maybe you can increase performance by using a chunk generator:

#!/usr/bin/env python

import csv
reader = csv.reader(open('4956984.csv', 'rb'))

def gen_chunks(reader, chunksize=100):
    """ 
    Chunk generator. Take a CSV `reader` and yield
    `chunksize` sized slices. 
    """
    chunk = []
    for i, line in enumerate(reader):
        if (i % chunksize == 0 and i > 0):
            yield chunk
            del chunk[:]
        chunk.append(line)
    yield chunk

for chunk in gen_chunks(reader):
    print chunk # process chunk

# test gen_chunk on some dummy sequence:
for chunk in gen_chunks(range(10), chunksize=3):
    print chunk # process chunk

# => yields
# [0, 1, 2]
# [3, 4, 5]
# [6, 7, 8]
# [9]

There isn't a good way to do this for all .csv files. You should be able to divide the file into chunks using file.seek to skip a section of the file. Then you have to scan one byte at a time to find the end of the row. The you can process the two chunks independently. Something like the following (untested) code should get you started.

file_one = open('foo.csv')
file_two = open('foo.csv') 
file_two.seek(0, 2)     # seek to the end of the file
sz = file_two.tell()    # fetch the offset
file_two.seek(sz / 2)   # seek back to the middle
chr = ''
while chr != '\n':
    chr = file_two.read(1)
# file_two is now positioned at the start of a record
segment_one = csv.reader(file_one)
segment_two = csv.reader(file_two)

I'm not sure how you can tell that you have finished traversing segment_one. If you have a column in the CSV that is a row id, then you can stop processing segment_one when you encounter the row id from the first row in segment_two.

We can use pandas module to handle these big csv files.

df = pd.DataFrame()
temp = pd.read_csv('BIG_File.csv', iterator=True, chunksize=1000)
df = pd.concat(temp, ignore_index=True)
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!