Find all-zero columns in pandas sparse matrix

妖精的绣舞 提交于 2019-12-06 06:00:44

Approach #1 We could do something like this -

# Get the columns indices of the input sparse matrix
C = sp.find(A)[1]

# Use np.in1d to create a mask of non-zero columns. 
# So, we invert it and convert to int dtype for desired output.
out = (~np.in1d(np.arange(A.shape[1]),C)).astype(int)

Alternatively, to make the code shorter, we can use subtraction -

out = 1-np.in1d(np.arange(A.shape[1]),C)

Step-by-step run -

1) Input array and sparse matrix from it :

In [137]: arr             # Regular dense array
Out[137]: 
array([[3, 0, 3, 0],
       [0, 0, 2, 0],
       [2, 5, 1, 0],
       [0, 0, 0, 0]])

In [138]: A = sp.coo_matrix(arr) # Convert to sparse matrix as input here on

2) Get non-zero column indices :

In [139]: C = sp.find(A)[1]

In [140]: C
Out[140]: array([0, 2, 2, 0, 1, 2], dtype=int32)

3) Use np.in1d to get mask of non-zero columns :

In [141]: np.in1d(np.arange(A.shape[1]),C)
Out[141]: array([ True,  True,  True, False], dtype=bool)

4) Invert it :

In [142]: ~np.in1d(np.arange(A.shape[1]),C)
Out[142]: array([False, False, False,  True], dtype=bool)

5) Finally convert to int dtype :

In [143]: (~np.in1d(np.arange(A.shape[1]),C)).astype(int)
Out[143]: array([0, 0, 0, 1])

Alternative subtraction approach :

In [145]: 1-np.in1d(np.arange(A.shape[1]),C)
Out[145]: array([0, 0, 0, 1])

Approach #2 Here's another way and possibly a faster one using matrix-multiplication -

out = 1-np.ones(A.shape[0],dtype=bool)*A.astype(bool)

Runtime test

Let's test out all the posted approaches on a big and really sparse matrix -

In [29]: A = sp.coo_matrix((np.random.rand(4000,4000)>0.998).astype(int))

In [30]: %timeit 1-np.in1d(np.arange(A.shape[1]),sp.find(A)[1])
100 loops, best of 3: 4.12 ms per loop # Approach1

In [31]: %timeit 1-np.ones(A.shape[0],dtype=bool)*A.astype(bool)
1000 loops, best of 3: 771 µs per loop # Approach2

In [32]: %timeit 1 - (A.col==np.arange(A.shape[1])[:,None]).any(axis=1)
1 loops, best of 3: 236 ms per loop # @hpaulj's soln

In [33]: %timeit (A!=0).sum(axis=0)==0
1000 loops, best of 3: 1.03 ms per loop  # @jez's soln

In [34]: %timeit (np.sum(np.absolute(A.toarray()), 0) == 0) * 1
10 loops, best of 3: 86.4 ms per loop  # @wwii's soln 

The actual logical operation can be performed like this:

b = (A!=0).sum(axis=0)==0
# matrix([[False, False, False,  True]], dtype=bool)

Now, to ensure that I'm answering your question exactly, I'd better tell you how you could convert from booleans to integers (although really, for most applications I can think of, you can do a lot more in numpy and friends if you stick with an array of bools):

b = b.astype(int)
#matrix([[0, 0, 0, 1]])

Either way, to then convert from a matrix to a list, you could do this:

c = list(b.flat)
# [0, 0, 0, 1]

...although again, I'm not sure this is the best thing to do: for most applications I can imagine, I would perhaps just convert to a one-dimensional numpy.array with c = b.A.flatten() instead.

hpaulj

Recent

scipy.sparse.coo_matrix how to fast find all zeros column, fill with 1 and normalize

similar, except it wants to fill those columns with 1s and normalize them.

I immediately suggested the lil format of the transpose. All-0 columns will be empty lists in this format. But sticking with the coo format I suggested

np.nonzero(~(Mo.col==np.arange(Mo.shape[1])[:,None]).any(axis=1))[0]

or for this 1/0 format

1 - (Mo.col==np.arange(Mo.shape[1])[:,None]).any(axis=1)

which is functionally the same as:

1 - np.in1d(np.arange(Mo.shape[1]),Mo.col)

sparse.find converts the matrix to csr to sum duplicates and eliminate duplicates, and then back to coo to get the data, row, and col attributes (which it returns).

Mo.nonzero uses A.data != 0 to eliminate 0s before returning the col and row attributes.

The np.ones(A.shape[0],dtype=bool)*A.astype(bool) solution requires converting A to csr format for multiplication.

(A!=0).sum(axis=0) also converts to csr because column (or row) sum is done with a matrix multiplication.

So the no-convert requirement is unrealistic, at least within the bounds of sparse formats.

===============

For Divakar's test case my == version is quite slow; it's ok with small ones, but creates too large of test array with the 1000 columns.

Testing on a matrix that is sparse enough to have a number of 0 columns:

In [183]: Arr=sparse.random(1000,1000,.001)
In [184]: (1-np.in1d(np.arange(Arr.shape[1]),Arr.col)).any()
Out[184]: True
In [185]: (1-np.in1d(np.arange(Arr.shape[1]),Arr.col)).sum()
Out[185]: 367

In [186]: timeit 1-np.ones(Arr.shape[0],dtype=bool)*Arr.astype(bool)
1000 loops, best of 3: 334 µs per loop
In [187]: timeit 1-np.in1d(np.arange(Arr.shape[1]),Arr.col)
1000 loops, best of 3: 323 µs per loop
In [188]: timeit 1-(Arr.col==np.arange(Arr.shape[1])[:,None]).any(axis=1)
100 loops, best of 3: 3.9 ms per loop
In [189]: timeit (Arr!=0).sum(axis=0)==0
1000 loops, best of 3: 820 µs per loop

Convert to an array or dense matrix, sum the absolute value along the first axis, test the result against zero, convert to int

>>> import numpy as np
>>> (np.sum(np.absolute(a.toarray()), 0) == 0) * 1
array([0, 0, 0, 1])
>>> (np.sum(np.absolute(a.todense()), 0) == 0) * 1
matrix([[0, 0, 0, 1]])
>>> 
>>> np.asarray((np.sum(np.absolute(a.todense()), 0) == 0), dtype = np.int32)
array([[0, 0, 0, 1]])
>>>

The first is the fastest - 24 uS for your example on my machine.

For a matrix made with np.random.randint(0,3,(1000,1000)), all are right at 25 mS on my machine.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!