I\'m using Python 2.6 on a Mac Mini with 1GB RAM. I want to read in a huge text file
$ ls -l links.csv; file links.csv; tail links.csv
-rw-r--r-- 1 user u
You might want to look at mmap:
http://docs.python.org/library/mmap.html
It'll let you treat the file like a big array/string and will get the OS to handle shuffling data into and out of memory to let it fit.
So you could read in the csv file, one line at a time then write out the results to a mmap'd file (in a suitable binary format), then work on the mmap'd file. As the mmap'd file is only temporary you could of course just create a tmp file for this purpose.
Here's some code that demos using mmap with a tempfile to read in csv data and store it as pair's of integers:
import sys
import mmap
import array
from tempfile import TemporaryFile
def write_int(buffer, i):
# convert i to 4 bytes and write into buffer
buffer.write(array.array('i', [i]).tostring())
def read_int(buffer, pos):
# get the 4 bytes at pos and convert to integer
offset = 4*pos
return array.array('i', buffer[offset:offset+4])[0]
def get_edge(edges, lineno):
pos = lineno*2
i, j = read_int(edges, pos), read_int(edges, pos+1)
return i, j
infile=open("links.csv", "r")
count=0
#count the total number of lines in the file
for line in infile:
count=count+1
total=count
print "Total number of lines: ",total
infile.seek(0)
# make mmap'd file that's long enough to contain all data
# assuming two integers (4 bytes) per line
tmp = TemporaryFile()
file_len = 2*4*count
# increase tmp file size
tmp.seek(file_len-1)
tmp.write(' ')
tmp.seek(0)
edges = mmap.mmap(tmp.fileno(), file_len)
for line in infile:
i, j=tuple(map(int,line.strip().split(",")))
write_int(edges, i)
write_int(edges, j)
# now confirm we can read the ints back out ok
for i in xrange(count):
print get_edge(edges, i)
It's a bit rough though. Really you'd probably want to wrap up all of that with a nice class, so that your edge's could be accessed in a way that makes them behave like a list (with indexing, len etc). Hopefully thought it'd give you a starting point.