Finding entries containing a substring in a numpy array?

问题

I tried to find entries in an Array containing a substring with np.where and an in condition:

import numpy as np
foo = "aa"
bar = np.array(["aaa", "aab", "aca"])
np.where(foo in bar)

this only returns an empty Array.
Why is that so?
And is there a good alternative solution?

回答1:

We can use np.core.defchararray.find to find the position of foo string in each element of bar, which would return -1 if not found. Thus, it could be used to detect whether foo is present in each element or not by checking for -1 on the output from find. Finally, we would use np.flatnonzero to get the indices of matches. So, we would have an implementation, like so -

np.flatnonzero(np.core.defchararray.find(bar,foo)!=-1)

Sample run -

In [91]: bar
Out[91]: 
array(['aaa', 'aab', 'aca'], 
      dtype='|S3')

In [92]: foo
Out[92]: 'aa'

In [93]: np.flatnonzero(np.core.defchararray.find(bar,foo)!=-1)
Out[93]: array([0, 1])

In [94]: bar[2] = 'jaa'

In [95]: np.flatnonzero(np.core.defchararray.find(bar,foo)!=-1)
Out[95]: array([0, 1, 2])

回答2:

The way you are trying to use np.where is incorrect. The first argument of np.where should be a boolean array, and you are simply passing it a boolean.

foo in bar
>>> False
np.where(False)
>>> (array([], dtype=int32),)
np.where(np.array([True, True, False]))
>>> (array([0, 1], dtype=int32),)

The problem is that numpy does not define the in operator as an element-wise boolean operation.

One way you could accomplish what you want is with a list comprehension.

foo = 'aa'
bar = np.array(['aaa', 'aab', 'aca'])
out = [i for i, v in enumerate(bar) if foo in v]
# out = [0, 1]

bar = ['aca', 'bba', 'baa', 'aaf', 'ccc']
out = [i for i, v in enumerate(bar) if foo in v]
# out = [2, 3]

回答3:

Look at some examples of using in:

In [19]: bar = np.array(["aaa", "aab", "aca"])

In [20]: 'aa' in bar
Out[20]: False

In [21]: 'aaa' in bar
Out[21]: True

In [22]: 'aab' in bar
Out[22]: True

In [23]: 'aab' in list(bar)

It looks like in when used with an array works as though the array was a list. ndarray does have a __contains__ method, so in works, but it is probably simple.

But in any case, note that in alist does not check for substrings. The strings __contains__ does the substring test, but I don't know any builtin class that propagates the test down to the component strings.

As Divakar shows there is a collection of numpy functions that applies string methods to individual elements of an array.

In [42]: np.char.find(bar, 'aa')
Out[42]: array([ 0,  0, -1])

Docstring:
This module contains a set of functions for vectorized string operations and methods. The preferred alias for defchararray is numpy.char.

For operations like this I think the np.char speeds are about same as with:

In [49]: np.frompyfunc(lambda x: x.find('aa'), 1, 1)(bar)
Out[49]: array([0, 0, -1], dtype=object)

In [50]: np.frompyfunc(lambda x: 'aa' in x, 1, 1)(bar)
Out[50]: array([True, True, False], dtype=object)

Further tests suggest that the ndarray __contains__ operates on the flat version of the array - that is, shape doesn't affect its behavior.

回答4:

You can also do something like this:

mask = [foo in x for x in bar]  
filter = bar[ np.where( mask * bar != '') ]

来源：https://stackoverflow.com/questions/38974168/finding-entries-containing-a-substring-in-a-numpy-array

标签

numpy

where

python-3.4

string-comparison