Pythonic way to create a numpy array from a list of numpy arrays

后端 未结 6 1074
执笔经年
执笔经年 2020-12-13 06:24

I generate a list of one dimensional numpy arrays in a loop and later convert this list to a 2d numpy array. I would\'ve preallocated a 2d numpy array if i knew the number o

相关标签:
6条回答
  • 2020-12-13 06:43

    Convenient way, using numpy.concatenate. I believe it's also faster, than @unutbu's answer:

    In [32]: import numpy as np 
    
    In [33]: list_of_arrays = list(map(lambda x: x * np.ones(2), range(5)))
    
    In [34]: list_of_arrays
    Out[34]: 
    [array([ 0.,  0.]),
     array([ 1.,  1.]),
     array([ 2.,  2.]),
     array([ 3.,  3.]),
     array([ 4.,  4.])]
    
    In [37]: shape = list(list_of_arrays[0].shape)
    
    In [38]: shape
    Out[38]: [2]
    
    In [39]: shape[:0] = [len(list_of_arrays)]
    
    In [40]: shape
    Out[40]: [5, 2]
    
    In [41]: arr = np.concatenate(list_of_arrays).reshape(shape)
    
    In [42]: arr
    Out[42]: 
    array([[ 0.,  0.],
           [ 1.,  1.],
           [ 2.,  2.],
           [ 3.,  3.],
           [ 4.,  4.]])
    
    0 讨论(0)
  • 2020-12-13 06:45

    Even simpler @fnjn answer

    np.vstack(list_of_arrays)
    
    0 讨论(0)
  • 2020-12-13 06:55

    Even simpler than @Gill Bates' answer, here is an one line code:

    np.stack(list_of_arrays, axis=0)
    
    0 讨论(0)
  • 2020-12-13 06:55

    I'll add my own version of ~unutbu's answer. Similar to numpy_all_the way, but you dynamically resize if you have an index error. I thought it would have been a little faster for small data sets, but it's a little slower - the bounds checking slows things down too much.

    initial_guess = 1000
    
    def my_numpy_all_the_way(k):
        arr=np.empty((initial_guess,M))
        for x,row in enumerate(make_test_data(k)):
            try:
                arr[x]=row
            except IndexError:
                arr.resize((arr.shape[0]*2, arr.shape[1]))
                arr[x]=row
        arr.resize((k,M))
        return arr
    
    0 讨论(0)
  • 2020-12-13 06:58

    Suppose you know that the final array arr will never be larger than 5000x10. Then you could pre-allocate an array of maximum size, populate it with data as you go through the loop, and then use arr.resize to cut it down to the discovered size after exiting the loop.

    The tests below suggest doing so will be slightly faster than constructing intermediate python lists no matter what the ultimate size of the array is.

    Also, arr.resize de-allocates the unused memory, so the final (though maybe not the intermediate) memory footprint is smaller than what is used by python_lists_to_array.

    This shows numpy_all_the_way is faster:

    % python -mtimeit -s"import test" "test.numpy_all_the_way(100)"
    100 loops, best of 3: 1.78 msec per loop
    % python -mtimeit -s"import test" "test.numpy_all_the_way(1000)"
    100 loops, best of 3: 18.1 msec per loop
    % python -mtimeit -s"import test" "test.numpy_all_the_way(5000)"
    10 loops, best of 3: 90.4 msec per loop
    
    % python -mtimeit -s"import test" "test.python_lists_to_array(100)"
    1000 loops, best of 3: 1.97 msec per loop
    % python -mtimeit -s"import test" "test.python_lists_to_array(1000)"
    10 loops, best of 3: 20.3 msec per loop
    % python -mtimeit -s"import test" "test.python_lists_to_array(5000)"
    10 loops, best of 3: 101 msec per loop
    

    This shows numpy_all_the_way uses less memory:

    % test.py
    Initial memory usage: 19788
    After python_lists_to_array: 20976
    After numpy_all_the_way: 20348
    

    test.py:

    import numpy as np
    import os
    
    
    def memory_usage():
        pid = os.getpid()
        return next(line for line in open('/proc/%s/status' % pid).read().splitlines()
                    if line.startswith('VmSize')).split()[-2]
    
    N, M = 5000, 10
    
    
    def python_lists_to_array(k):
        list_of_arrays = list(map(lambda x: x * np.ones(M), range(k)))
        arr = np.array(list_of_arrays)
        return arr
    
    
    def numpy_all_the_way(k):
        arr = np.empty((N, M))
        for x in range(k):
            arr[x] = x * np.ones(M)
        arr.resize((k, M))
        return arr
    
    if __name__ == '__main__':
        print('Initial memory usage: %s' % memory_usage())
        arr = python_lists_to_array(5000)
        print('After python_lists_to_array: %s' % memory_usage())
        arr = numpy_all_the_way(5000)
        print('After numpy_all_the_way: %s' % memory_usage())
    
    0 讨论(0)
  • 2020-12-13 07:03

    What you are doing is the standard way. A property of numpy arrays is that they need contiguous memory. The only possibility of "holes" that I can think of is possible with the strides member of PyArrayObject, but that doesn't affect the discussion here. Since numpy arrays have contiguous memory and are "preallocated", adding a new row/column means allocating new memory, copying data, and then freeing the old memory. If you do that a lot, it is not very efficient.

    One case where someone might not want to create a list and then convert it to a numpy array in the end is when the list contains a lot of numbers: a numpy array of numbers takes much less space than a native Python list of numbers (since the native Python list stores Python objects). For your typical array sizes, I don't think that is an issue.

    When you create your final array from a list of arrays, you are copying all the data to a new location for the new (2-d in your example) array. This is still much more efficient than having a numpy array and doing next = numpy.vstack((next, new_row)) every time you get new data. vstack() will copy all the data for every "row".

    There was a thread on numpy-discussion mailing list some time ago which discussed the possibility of adding a new numpy array type that allows efficient extending/appending. It seems there was significant interest in this at that time, although I don't know if something came out of it. You might want to look at that thread.

    I would say that what you're doing is very Pythonic, and efficient, so unless you really need something else (more space efficiency, maybe?), you should be okay. That is how I create my numpy arrays when I don't know the number of elements in the array in the beginning.

    0 讨论(0)
提交回复
热议问题