What are the downsides of always using numpy arrays instead of python lists?

问题

I'm writing a program in which I want to flatten an array, so I used the following code:

list_of_lists = [["a","b","c"], ["d","e","f"], ["g","h","i"]]
flattened_list = [i for j in list_of_lists for i in j]

This results in ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i'], the desired output.

I then found out that using a numpy array, I could've done the same simply by using np.array(((1,2),(3,4),(5,6))).flatten().

I was wondering if there is any downside to always using numpy arrays in the place of regular Python lists? In other words, is there something that Python lists can do which numpy arrays can't?

回答1:

With your small example, the list comprehension is faster than the array method, even when taking the array creation out of the timing loop:

In [204]: list_of_lists = [["a","b","c"], ["d","e","f"], ["g","h","i"]] 
     ...: flattened_list = [i for j in list_of_lists for i in j]    

In [205]: timeit [i for j in list_of_lists for i in j]                                                       
757 ns ± 17.3 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

In [206]: np.ravel(list_of_lists)                                                                            
Out[206]: array(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i'], dtype='<U1')

In [207]: timeit np.ravel(list_of_lists)                                                                     
8.05 µs ± 12.4 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

In [208]: %%timeit x = np.array(list_of_lists) 
     ...: np.ravel(x)                                                                                                     
2.33 µs ± 22.8 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

With a much larger example, I expect [208] to get better.

If the sublists differ in size, the array is not 2d, and flatten does nothing:

In [209]: list_of_lists = [["a","b","c",23], ["d",None,"f"], ["g","h","i"]] 
     ...: flattened_list = [i for j in list_of_lists for i in j]                                             
In [210]: flattened_list                                                                                     
Out[210]: ['a', 'b', 'c', 23, 'd', None, 'f', 'g', 'h', 'i']
In [211]: np.array(list_of_lists)                                                                            
Out[211]: 
array([list(['a', 'b', 'c', 23]), list(['d', None, 'f']),
       list(['g', 'h', 'i'])], dtype=object)

Growing lists is more efficient:

In [217]: alist = []                                                                                         
In [218]: for row in list_of_lists: 
     ...:     alist.append(row) 
     ...:                                                                                                    
In [219]: alist                                                                                              
Out[219]: [['a', 'b', 23], ['d', None, 'f'], ['g', 'h', 'i']]
In [220]: np.array(alist)                                                                                    
Out[220]: 
array([['a', 'b', 23],
       ['d', None, 'f'],
       ['g', 'h', 'i']], dtype=object)

We strongly discourage iterative concatenation. Collect the sublists or arrays in a list first.

回答2:

Yes there are. The rule of thumb would be to remember numpy.array is better for data of the same datatype (all integers, all double precision fp, all booleans, strings of the same length etc) instead of a mix bag of things. In the latter case you might just as well using generic list, considering this:

In [93]: a = [b'5', 5, '55', 'ab', 'cde', 'ef', 4, 6]

In [94]: b = np.array(a)

In [95]: %timeit 5 in a
65.6 ns ± 0.79 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)

In [96]: %timeit 6 in a  # worst case
219 ns ± 5.48 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

In [97]: %timeit 5 in b
10.9 µs ± 217 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

look at this several magnitudes of performance difference, where numpy.array is slower! Certainly this depends on the dimension of the list, and in this particular case depends on the index of 5 or 6 (worst case of O(n) complexity), but you get the idea.

回答3:

Numpy arrays and functions are better for the most part. Here is an article if you want to look into it more: https://webcourses.ucf.edu/courses/1249560/pages/python-lists-vs-numpy-arrays-what-is-the-difference

来源：https://stackoverflow.com/questions/57897942/what-are-the-downsides-of-always-using-numpy-arrays-instead-of-python-lists

标签

python

numpy