Splitting a CSV file into equal parts?

后端 未结 1 1460
面向向阳花
面向向阳花 2020-12-11 11:00

I have a large CSV file that I would like to split into a number that is equal to the number of CPU cores in the system. I want to then use multiprocess to have all the core

相关标签:
1条回答
  • 2020-12-11 11:26

    As I said in a comment, csv files would need to be split on row (or line) boundaries. Your code doesn't do this and potentially breaks them up somewhere in the middle of one — which I suspect is the cause of your _csv.Error.

    The following avoids doing that by processing the input file as a series of lines. I've tested it and it seems to work standalone in the sense that it divided the sample file up into approximately equally size chunks because it's unlikely that an whole number of rows will fit exactly into a chunk.

    Update

    This it is a substantially faster version of the code than I originally posted. The improvement is because it now uses the temp file's own tell() method to determine the constantly changing length of the file as it's being written instead of calling os.path.getsize(), which eliminated the need to flush() the file and call os.fsync() on it after each row is written.

    import csv
    import multiprocessing
    import os
    import tempfile
    
    def split(infilename, num_chunks=multiprocessing.cpu_count()):
        READ_BUFFER = 2**13
        in_file_size = os.path.getsize(infilename)
        print 'in_file_size:', in_file_size
        chunk_size = in_file_size // num_chunks
        print 'target chunk_size:', chunk_size
        files = []
        with open(infilename, 'rb', READ_BUFFER) as infile:
            for _ in xrange(num_chunks):
                temp_file = tempfile.TemporaryFile()
                while temp_file.tell() < chunk_size:
                    try:
                        temp_file.write(infile.next())
                    except StopIteration:  # end of infile
                        break
                temp_file.seek(0)  # rewind
                files.append(temp_file)
        return files
    
    files = split("sample_simple.csv", num_chunks=4)
    print 'number of files created: {}'.format(len(files))
    
    for i, ifile in enumerate(files, start=1):
        print 'size of temp file {}: {}'.format(i, os.path.getsize(ifile.name))
        print 'contents of file {}:'.format(i)
        reader = csv.reader(ifile)
        for row in reader:
            print row
        print ''
    
    0 讨论(0)
提交回复
热议问题