How do you split reading a large csv file into evenly-sized chunks in Python?

前端 未结 3 755
囚心锁ツ
囚心锁ツ 2020-12-01 03:16

In a basic I had the next process.

import csv
reader = csv.reader(open(\'huge_file.csv\', \'rb\'))

for line in reader:
    process_line(line)
相关标签:
3条回答
  • 2020-12-01 03:29

    There isn't a good way to do this for all .csv files. You should be able to divide the file into chunks using file.seek to skip a section of the file. Then you have to scan one byte at a time to find the end of the row. The you can process the two chunks independently. Something like the following (untested) code should get you started.

    file_one = open('foo.csv')
    file_two = open('foo.csv') 
    file_two.seek(0, 2)     # seek to the end of the file
    sz = file_two.tell()    # fetch the offset
    file_two.seek(sz / 2)   # seek back to the middle
    chr = ''
    while chr != '\n':
        chr = file_two.read(1)
    # file_two is now positioned at the start of a record
    segment_one = csv.reader(file_one)
    segment_two = csv.reader(file_two)
    

    I'm not sure how you can tell that you have finished traversing segment_one. If you have a column in the CSV that is a row id, then you can stop processing segment_one when you encounter the row id from the first row in segment_two.

    0 讨论(0)
  • 2020-12-01 03:32

    Just make your reader subscriptable by wrapping it into a list. Obviously this will break on really large files (see alternatives in the Updates below):

    >>> reader = csv.reader(open('big.csv', 'rb'))
    >>> lines = list(reader)
    >>> print lines[:100]
    ...
    

    Further reading: How do you split a list into evenly sized chunks in Python?


    Update 1 (list version): Another possible way would just process each chuck, as it arrives while iterating over the lines:

    #!/usr/bin/env python
    
    import csv
    reader = csv.reader(open('4956984.csv', 'rb'))
    
    chunk, chunksize = [], 100
    
    def process_chunk(chuck):
        print len(chuck)
        # do something useful ...
    
    for i, line in enumerate(reader):
        if (i % chunksize == 0 and i > 0):
            process_chunk(chunk)
            del chunk[:]  # or: chunk = []
        chunk.append(line)
    
    # process the remainder
    process_chunk(chunk)
    

    Update 2 (generator version): I haven't benchmarked it, but maybe you can increase performance by using a chunk generator:

    #!/usr/bin/env python
    
    import csv
    reader = csv.reader(open('4956984.csv', 'rb'))
    
    def gen_chunks(reader, chunksize=100):
        """ 
        Chunk generator. Take a CSV `reader` and yield
        `chunksize` sized slices. 
        """
        chunk = []
        for i, line in enumerate(reader):
            if (i % chunksize == 0 and i > 0):
                yield chunk
                del chunk[:]  # or: chunk = []
            chunk.append(line)
        yield chunk
    
    for chunk in gen_chunks(reader):
        print chunk # process chunk
    
    # test gen_chunk on some dummy sequence:
    for chunk in gen_chunks(range(10), chunksize=3):
        print chunk # process chunk
    
    # => yields
    # [0, 1, 2]
    # [3, 4, 5]
    # [6, 7, 8]
    # [9]
    

    There is a minor gotcha, as @totalhack points out:

    Be aware that this yields the same object over and over with different contents. This works fine if you plan on doing everything you need to with the chunk between each iteration.

    0 讨论(0)
  • 2020-12-01 03:34

    We can use pandas module to handle these big csv files.

    df = pd.DataFrame()
    temp = pd.read_csv('BIG_File.csv', iterator=True, chunksize=1000)
    df = pd.concat(temp, ignore_index=True)
    
    0 讨论(0)
提交回复
热议问题