I have a 7GB csv
file which I\'d like to split into smaller chunks, so it is readable and faster for analysis in Python on a notebook. I would like to grab a sm
You don't need Python to split a csv file. Using your shell:
$ split -l 100 data.csv
Would split data.csv
in chunks of 100 lines.
I had to do a similar task, and used the pandas package:
for i,chunk in enumerate(pd.read_csv('bigfile.csv', chunksize=500000)):
chunk.to_csv('chunk{}.csv'.format(i), index=False)
See the Python docs on file
objects (the object returned by open(filename)
- you can choose to read
a specified number of bytes, or use readline
to work through one line at a time.
I agree with @jonrsharpe readline should be able to read one line at a time even for big files.
If you are dealing with big csv files might I suggest using pandas.read_csv. I often use it for the same purpose and always find it awesome (and fast). Takes a bit of time to get used to idea of DataFrames. But once you get over that it speeds up large operations like yours massively.
Hope it helps.
Maybe something like this?
#!/usr/local/cpython-3.3/bin/python
import csv
divisor = 10
outfileno = 1
outfile = None
with open('big.csv', 'r') as infile:
for index, row in enumerate(csv.reader(infile)):
if index % divisor == 0:
if outfile is not None:
outfile.close()
outfilename = 'big-{}.csv'.format(outfileno)
outfile = open(outfilename, 'w')
outfileno += 1
writer = csv.writer(outfile)
writer.writerow(row)