I have a large CSV file that I would like to split into a number that is equal to the number of CPU cores in the system. I want to then use multiprocess to have all the core
As I said in a comment, csv files would need to be split on row (or line) boundaries. Your code doesn't do this and potentially breaks them up somewhere in the middle of one — which I suspect is the cause of your _csv.Error
.
The following avoids doing that by processing the input file as a series of lines. I've tested it and it seems to work standalone in the sense that it divided the sample file up into approximately equally size chunks because it's unlikely that an whole number of rows will fit exactly into a chunk.
Update
This it is a substantially faster version of the code than I originally posted. The improvement is because it now uses the temp file's own tell()
method to determine the constantly changing length of the file as it's being written instead of calling os.path.getsize()
, which eliminated the need to flush()
the file and call os.fsync()
on it after each row is written.
import csv
import multiprocessing
import os
import tempfile
def split(infilename, num_chunks=multiprocessing.cpu_count()):
READ_BUFFER = 2**13
in_file_size = os.path.getsize(infilename)
print 'in_file_size:', in_file_size
chunk_size = in_file_size // num_chunks
print 'target chunk_size:', chunk_size
files = []
with open(infilename, 'rb', READ_BUFFER) as infile:
for _ in xrange(num_chunks):
temp_file = tempfile.TemporaryFile()
while temp_file.tell() < chunk_size:
try:
temp_file.write(infile.next())
except StopIteration: # end of infile
break
temp_file.seek(0) # rewind
files.append(temp_file)
return files
files = split("sample_simple.csv", num_chunks=4)
print 'number of files created: {}'.format(len(files))
for i, ifile in enumerate(files, start=1):
print 'size of temp file {}: {}'.format(i, os.path.getsize(ifile.name))
print 'contents of file {}:'.format(i)
reader = csv.reader(ifile)
for row in reader:
print row
print ''