Fastest way to load huge .dat into array

问题

I have extensively searched in stackexchange a neat solution for loading a huge (~2GB) .dat file into a numpy array, but didn't find a proper solution. So far I managed to load it as a list in a really fast way (<1 min):

list=[]
f = open('myhugefile0')
for line in f:
    list.append(line)
f.close()

Using np.loadtxt freezes my computer and takes several minutes to load (~ 10 min). How can I open the file as an array without the allocating issue that seems to bottleneck np.loadtxt?

EDIT:

Input data is a float (200000,5181) array. One line example:

2.27069e-15 2.40985e-15 2.22525e-15 2.1138e-15 1.92038e-15 1.54218e-15 1.30739e-15 1.09205e-15 8.53416e-16 7.71566e-16 7.58353e-16 7.58362e-16 8.81664e-16 1.09204e-15 1.27305e-15 1.58008e-15

and so on

Thanks

回答1:

Looking at the source, it appears that numpy.loadtxt contains a lot of code to handle many different formats. In case you have a well defined input file, it is not too difficult to write your own function optimized for your particular file format. Something like this (untested):

def load_big_file(fname):
    '''only works for well-formed text file of space-separated doubles'''

    rows = []  # unknown number of lines, so use list
    with open(fname) as f:
        for line in f:
            line = [float(s) for s in line.split()]
            rows.append(np.array(line, dtype = np.double))
    return np.vstack(rows)  # convert list of vectors to array

An alternative solution, if the number of rows and columns is known before, might be:

def load_known_size(fname, nrow, ncol)
    x = np.empty((nrow, ncol), dtype = np.double)
    with open(fname) as f:
        for irow, line in enumerate(f):
            for icol, s in enumerate(line.split()):
                x[irow, icol] = float(s)
    return x

In this way, you don't have to allocate all the intermediate lists.

EDIT: Seems that the second solution is a bit slower, the list comprehension is probably faster than the explicit for loop. Combining the two solutions, and using the trick that Numpy does implicit conversion from string to float (only discovered that just now), this might possibly be faster:

def load_known_size(fname, nrow, ncol)
    x = np.empty((nrow, ncol), dtype = np.double)
    with open(fname) as f:
        for irow, line in enumerate(f):
            x[irow, :] = line.split()
    return x

To get any further speedup, you would probably have to use some code written in C or Cython. I would be interested to know how much time these functions take to open your files.

来源：https://stackoverflow.com/questions/26482209/fastest-way-to-load-huge-dat-into-array

标签

python

numpy

bigdata