Python: How to read huge text file into memory

前端未结

关注

 6  1053

自闭症患者 2020-11-29 21:36

I\'m using Python 2.6 on a Mac Mini with 1GB RAM. I want to read in a huge text file

$ ls -l links.csv; file links.csv; tail links.csv 
-rw-r--r--  1 user  u


      
      
        
          6条回答        

        
                    
            
            
                         
                
              
              
                
                   佛祖请我去吃肉
                                             
                
                
                (楼主)
            
              
              
                2020-11-29 22:33
              

            
            
                        
There is a recipe for sorting files larger than RAM on this page, though you'd have to adapt it for your case involving CSV-format data.  There are also links to additional resources there.

Edit: True, the file on disk is not "larger than RAM", but the in-memory representation can easily become much larger than available RAM.  For one thing, your own program doesn't get the entire 1GB (OS overhead etc).  For another, even if you stored this in the most compact form for pure Python (two lists of integers, assuming 32-bit machine etc), you'd be using 934MB for those 30M pairs of integers.

Using numpy you can also do the job, using only about 250MB.  It isn't particular fast to load this way, as you have to count the lines and pre-allocate the array, but it may be the fastest actual sort given that it's in-memory:

import time
import numpy as np
import csv

start = time.time()
def elapsed():
    return time.time() - start

# count data rows, to preallocate array
f = open('links.csv', 'rb')
def count(f):
    while 1:
        block = f.read(65536)
        if not block:
             break
        yield block.count(',')

linecount = sum(count(f))
print '\n%.3fs: file has %s rows' % (elapsed(), linecount)

# pre-allocate array and load data into array
m = np.zeros(linecount, dtype=[('a', np.uint32), ('b', np.uint32)])
f.seek(0)
f = csv.reader(open('links.csv', 'rb'))
for i, row in enumerate(f):
    m[i] = int(row[0]), int(row[1])

print '%.3fs: loaded' % elapsed()
# sort in-place
m.sort(order='b')

print '%.3fs: sorted' % elapsed()


Output on my machine with a sample file similar to what you showed:

6.139s: file has 33253213 lines
238.130s: read into memory
517.669s: sorted


The default in numpy is Quicksort.  The ndarray.sort() routine (which sorts in-place) can also take keyword argument kind="mergesort" or kind="heapsort" but it appears neither of these is capable of sorting on a Record Array which, incidentally, I used as the only way I could see to sort the columns together as opposed to the default which would sort them independently (totally messing up your data).
    
             
                                                        
            
            
              
                
                0
              
                   
                
               讨论(0)
              
                                                  
              
              
                          
             
       
          
              
                                       
     查看其它6个回答


            
                         
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
                              			
        
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复