Python: How to read huge text file into memory

前端 未结 6 1047
自闭症患者
自闭症患者 2020-11-29 21:36

I\'m using Python 2.6 on a Mac Mini with 1GB RAM. I want to read in a huge text file

$ ls -l links.csv; file links.csv; tail links.csv 
-rw-r--r--  1 user  u         


        
6条回答
  •  抹茶落季
    2020-11-29 22:30

    You might want to look at mmap:

    http://docs.python.org/library/mmap.html

    It'll let you treat the file like a big array/string and will get the OS to handle shuffling data into and out of memory to let it fit.

    So you could read in the csv file, one line at a time then write out the results to a mmap'd file (in a suitable binary format), then work on the mmap'd file. As the mmap'd file is only temporary you could of course just create a tmp file for this purpose.

    Here's some code that demos using mmap with a tempfile to read in csv data and store it as pair's of integers:

    
    import sys
    import mmap
    import array
    from tempfile import TemporaryFile
    
    def write_int(buffer, i):
        # convert i to 4 bytes and write into buffer
        buffer.write(array.array('i', [i]).tostring())
    
    def read_int(buffer, pos):
        # get the 4 bytes at pos and convert to integer
        offset = 4*pos
        return array.array('i', buffer[offset:offset+4])[0]
    
    def get_edge(edges, lineno):
        pos = lineno*2
        i, j = read_int(edges, pos), read_int(edges, pos+1)
        return i, j
    
    infile=open("links.csv", "r")
    
    count=0
    #count the total number of lines in the file
    for line in infile:
        count=count+1
    
    total=count
    print "Total number of lines: ",total
    
    infile.seek(0)
    
    # make mmap'd file that's long enough to contain all data
    # assuming two integers (4 bytes) per line
    tmp = TemporaryFile()
    file_len = 2*4*count
    # increase tmp file size
    tmp.seek(file_len-1)
    tmp.write(' ')
    tmp.seek(0)
    edges = mmap.mmap(tmp.fileno(), file_len)
    
    for line in infile:
        i, j=tuple(map(int,line.strip().split(",")))
        write_int(edges, i)
        write_int(edges, j)
    
    # now confirm we can read the ints back out ok
    for i in xrange(count):
        print get_edge(edges, i)
    

    It's a bit rough though. Really you'd probably want to wrap up all of that with a nice class, so that your edge's could be accessed in a way that makes them behave like a list (with indexing, len etc). Hopefully thought it'd give you a starting point.

提交回复
热议问题