Pythonic way to create a numpy array from a list of numpy arrays

后端未结

关注

 6  1083

I generate a list of one dimensional numpy arrays in a loop and later convert this list to a 2d numpy array. I would\'ve preallocated a 2d numpy array if i knew the number o

相关标签:

6条回答

悲哀的现实

2020-12-13 06:43

Convenient way, using numpy.concatenate. I believe it's also faster, than @unutbu's answer:

In [32]: import numpy as np 

In [33]: list_of_arrays = list(map(lambda x: x * np.ones(2), range(5)))

In [34]: list_of_arrays
Out[34]: 
[array([ 0.,  0.]),
 array([ 1.,  1.]),
 array([ 2.,  2.]),
 array([ 3.,  3.]),
 array([ 4.,  4.])]

In [37]: shape = list(list_of_arrays[0].shape)

In [38]: shape
Out[38]: [2]

In [39]: shape[:0] = [len(list_of_arrays)]

In [40]: shape
Out[40]: [5, 2]

In [41]: arr = np.concatenate(list_of_arrays).reshape(shape)

In [42]: arr
Out[42]: 
array([[ 0.,  0.],
       [ 1.,  1.],
       [ 2.,  2.],
       [ 3.,  3.],
       [ 4.,  4.]])

0 讨论(0)

我寻月下人不归

2020-12-13 06:45
Even simpler @fnjn answer
```
np.vstack(list_of_arrays)
```
0 讨论(0)
发布评论:

提交评论
- 加载中...
無奈伤痛

2020-12-13 06:55
Even simpler than @Gill Bates' answer, here is an one line code:
```
np.stack(list_of_arrays, axis=0)
```
0 讨论(0)
发布评论:

提交评论
- 加载中...
小蘑菇

2020-12-13 06:55
I'll add my own version of ~unutbu's answer. Similar to numpy_all_the way, but you dynamically resize if you have an index error. I thought it would have been a little faster for small data sets, but it's a little slower - the bounds checking slows things down too much.
```
initial_guess = 1000

def my_numpy_all_the_way(k):
    arr=np.empty((initial_guess,M))
    for x,row in enumerate(make_test_data(k)):
        try:
            arr[x]=row
        except IndexError:
            arr.resize((arr.shape[0]*2, arr.shape[1]))
            arr[x]=row
    arr.resize((k,M))
    return arr
```
0 讨论(0)
发布评论:

提交评论
- 加载中...

执念已碎

2020-12-13 06:58

Suppose you know that the final array arr will never be larger than 5000x10. Then you could pre-allocate an array of maximum size, populate it with data as you go through the loop, and then use arr.resize to cut it down to the discovered size after exiting the loop.

The tests below suggest doing so will be slightly faster than constructing intermediate python lists no matter what the ultimate size of the array is.

Also, arr.resize de-allocates the unused memory, so the final (though maybe not the intermediate) memory footprint is smaller than what is used by python_lists_to_array.

This shows numpy_all_the_way is faster:

% python -mtimeit -s"import test" "test.numpy_all_the_way(100)"
100 loops, best of 3: 1.78 msec per loop
% python -mtimeit -s"import test" "test.numpy_all_the_way(1000)"
100 loops, best of 3: 18.1 msec per loop
% python -mtimeit -s"import test" "test.numpy_all_the_way(5000)"
10 loops, best of 3: 90.4 msec per loop

% python -mtimeit -s"import test" "test.python_lists_to_array(100)"
1000 loops, best of 3: 1.97 msec per loop
% python -mtimeit -s"import test" "test.python_lists_to_array(1000)"
10 loops, best of 3: 20.3 msec per loop
% python -mtimeit -s"import test" "test.python_lists_to_array(5000)"
10 loops, best of 3: 101 msec per loop

This shows numpy_all_the_way uses less memory:

% test.py
Initial memory usage: 19788
After python_lists_to_array: 20976
After numpy_all_the_way: 20348

test.py:

import numpy as np
import os


def memory_usage():
    pid = os.getpid()
    return next(line for line in open('/proc/%s/status' % pid).read().splitlines()
                if line.startswith('VmSize')).split()[-2]

N, M = 5000, 10


def python_lists_to_array(k):
    list_of_arrays = list(map(lambda x: x * np.ones(M), range(k)))
    arr = np.array(list_of_arrays)
    return arr


def numpy_all_the_way(k):
    arr = np.empty((N, M))
    for x in range(k):
        arr[x] = x * np.ones(M)
    arr.resize((k, M))
    return arr

if __name__ == '__main__':
    print('Initial memory usage: %s' % memory_usage())
    arr = python_lists_to_array(5000)
    print('After python_lists_to_array: %s' % memory_usage())
    arr = numpy_all_the_way(5000)
    print('After numpy_all_the_way: %s' % memory_usage())

0 讨论(0)

醉酒成梦

2020-12-13 07:03

What you are doing is the standard way. A property of numpy arrays is that they need contiguous memory. The only possibility of "holes" that I can think of is possible with the strides member of PyArrayObject, but that doesn't affect the discussion here. Since numpy arrays have contiguous memory and are "preallocated", adding a new row/column means allocating new memory, copying data, and then freeing the old memory. If you do that a lot, it is not very efficient.

One case where someone might not want to create a list and then convert it to a numpy array in the end is when the list contains a lot of numbers: a numpy array of numbers takes much less space than a native Python list of numbers (since the native Python list stores Python objects). For your typical array sizes, I don't think that is an issue.

When you create your final array from a list of arrays, you are copying all the data to a new location for the new (2-d in your example) array. This is still much more efficient than having a numpy array and doing next = numpy.vstack((next, new_row)) every time you get new data. vstack() will copy all the data for every "row".

There was a thread on numpy-discussion mailing list some time ago which discussed the possibility of adding a new numpy array type that allows efficient extending/appending. It seems there was significant interest in this at that time, although I don't know if something came out of it. You might want to look at that thread.

I would say that what you're doing is very Pythonic, and efficient, so unless you really need something else (more space efficiency, maybe?), you should be okay. That is how I create my numpy arrays when I don't know the number of elements in the array in the beginning.

0 讨论(0)
发布评论:

提交评论
- 加载中...