Convert Python sequence to NumPy array, filling missing values

匿名 (未验证) 提交于 2019-12-03 01:49:02

问题:

The implicit conversion of a Python sequence of variable-length lists into a NumPy array cause the array to be of type object.

v = [[1], [1, 2]] np.array(v) >>> array([[1], [1, 2]], dtype=object)

Trying to force another type will cause an exception:

np.array(v, dtype=np.int32) ValueError: setting an array element with a sequence.

What is the most efficient way to get a dense NumPy array of type int32, by filling the "missing" values with a given placeholder?

From my sample sequence v, I would like to get something like this, if 0 is the placeholder

array([[1, 0], [1, 2]], dtype=int32)

回答1:

You can use itertools.zip_longest:

import itertools np.array(list(itertools.zip_longest(*v, fillvalue=0))).T Out:  array([[1, 0],        [1, 2]])

Note: For Python 2, it is itertools.izip_longest.



回答2:

Pandas and its DataFrame-s deal beautifully with missing data.

import numpy as np import pandas as pd  v = [[1], [1, 2]] print(pd.DataFrame(v).fillna(0).values.astype(np.int32))  # array([[1, 0], #        [1, 2]], dtype=int32)


回答3:

Here's an almost* vectorized boolean-indexing based approach that I have used in several other posts -

def boolean_indexing(v):     lens = np.array([len(item) for item in v])     mask = lens[:,None] > np.arange(lens.max())     out = np.zeros(mask.shape,dtype=int)     out[mask] = np.concatenate(v)     return out

Sample run

In [27]: v Out[27]: [[1], [1, 2], [3, 6, 7, 8, 9], [4]]  In [28]: out Out[28]:  array([[1, 0, 0, 0, 0],        [1, 2, 0, 0, 0],        [3, 6, 7, 8, 9],        [4, 0, 0, 0, 0]])

*Please note that this coined as almost vectorized because the only looping performed here is at the start, where we are getting the lengths of the list elements. But that part not being so computationally demanding should have minimal effect on the total runtime.

Runtime test

In this section I am timing DataFrame-based solution by @Alberto Garcia-Raboso, itertools-based solution by @ayhan as they seem to scale well and the boolean-indexing based one from this post for a relatively larger dataset with three levels of size variation across the list elements.

Case #1 : Larger size variation

In [44]: v = [[1], [1,2,4,8,4],[6,7,3,6,7,8,9,3,6,4,8,3,2,4,5,6,6,8,7,9,3,6,4]]  In [45]: v = v*1000  In [46]: %timeit pd.DataFrame(v).fillna(0).values.astype(np.int32) 100 loops, best of 3: 9.82 ms per loop  In [47]: %timeit np.array(list(itertools.izip_longest(*v, fillvalue=0))).T 100 loops, best of 3: 5.11 ms per loop  In [48]: %timeit boolean_indexing(v) 100 loops, best of 3: 6.88 ms per loop

Case #2 : Lesser size variation

In [49]: v = [[1], [1,2,4,8,4],[6,7,3,6,7,8]]  In [50]: v = v*1000  In [51]: %timeit pd.DataFrame(v).fillna(0).values.astype(np.int32) 100 loops, best of 3: 3.12 ms per loop  In [52]: %timeit np.array(list(itertools.izip_longest(*v, fillvalue=0))).T 1000 loops, best of 3: 1.55 ms per loop  In [53]: %timeit boolean_indexing(v) 100 loops, best of 3: 5 ms per loop

Case #3 : Larger number of elements (100 max) per list element

In [139]: # Setup inputs      ...: N = 10000 # Number of elems in list      ...: maxn = 100 # Max. size of a list element      ...: lens = np.random.randint(0,maxn,(N))      ...: v = [list(np.random.randint(0,9,(L))) for L in lens]      ...:   In [140]: %timeit pd.DataFrame(v).fillna(0).values.astype(np.int32) 1 loops, best of 3: 292 ms per loop  In [141]: %timeit np.array(list(itertools.izip_longest(*v, fillvalue=0))).T 1 loops, best of 3: 264 ms per loop  In [142]: %timeit boolean_indexing(v) 10 loops, best of 3: 95.7 ms per loop

To me, it seems itertools.izip_longest is doing pretty well! there's no clear winner, but would have to be taken on a case-by-case basis!



回答4:

max_len = max(len(sub_list) for sub_list in v)  result = np.array([sub_list + [0] * (max_len - len(sub_list)) for sub_list in v])  >>> result array([[1, 0],        [1, 2]])  >>> type(result) numpy.ndarray


回答5:

Here is a general way:

>>> v = [[1], [2, 3, 4], [5, 6], [7, 8, 9, 10], [11, 12]] >>> max_len = np.argmax(v) >>> np.hstack(np.insert(v, range(1, len(v)+1),[[0]*(max_len-len(i)) for i in v])).astype('int32').reshape(len(v), max_len) array([[ 1,  0,  0,  0],        [ 2,  3,  4,  0],        [ 5,  6,  0,  0],        [ 7,  8,  9, 10],        [11, 12,  0,  0]], dtype=int32)


易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!