问题
I have an array traced_descIDs
containing object IDs and I want to identify which items are not unique in this array. Then, for each unique duplicate (careful) ID, I need to identify which indices of traced_descIDs
are associated with it.
As an example, if we take the traced_descIDs here, I want the following process to occur:
traced_descIDs = [1, 345, 23, 345, 90, 1]
dupIds = [1, 345]
dupInds = [[0,5],[1,3]]
I'm currently finding out which objects have more than 1 entry by:
mentions = np.array([len(np.argwhere( traced_descIDs == i)) for i in traced_descIDs])
dupMask = (mentions > 1)
however, this takes too long as len( traced_descIDs )
is around 150,000. Is there a faster way to achieve the same result?
Any help greatly appreciated. Cheers.
回答1:
While dictionaries are O(n), the overhead of Python objects sometimes makes it more convenient to use numpy's functions, which use sorting and are O(n*log n). In your case, the starting point would be:
a = [1, 345, 23, 345, 90, 1]
unq, unq_idx, unq_cnt = np.unique(a, return_inverse=True, return_counts=True)
If you are using a version of numpy earlier than 1.9, then that last line would have to be:
unq, unq_idx = np.unique(a, return_inverse=True)
unq_cnt = np.bincount(unq_idx)
The contents of the three arrays we have created are:
>>> unq
array([ 1, 23, 90, 345])
>>> unq_idx
array([0, 3, 1, 3, 2, 0])
>>> unq_cnt
array([2, 1, 1, 2])
To get the repeated items:
cnt_mask = unq_cnt > 1
dup_ids = unq[cnt_mask]
>>> dup_ids
array([ 1, 345])
Getting the indices is a little more involved, but pretty straightforward:
cnt_idx, = np.nonzero(cnt_mask)
idx_mask = np.in1d(unq_idx, cnt_idx)
idx_idx, = np.nonzero(idx_mask)
srt_idx = np.argsort(unq_idx[idx_mask])
dup_idx = np.split(idx_idx[srt_idx], np.cumsum(unq_cnt[cnt_mask])[:-1])
>>> dup_idx
[array([0, 5]), array([1, 3])]
回答2:
There is scipy.stats.itemfreq which would give the frequency of each item:
>>> xs = np.array([1, 345, 23, 345, 90, 1])
>>> ifreq = sp.stats.itemfreq(xs)
>>> ifreq
array([[ 1, 2],
[ 23, 1],
[ 90, 1],
[345, 2]])
>>> [(xs == w).nonzero()[0] for w in ifreq[ifreq[:,1] > 1, 0]]
[array([0, 5]), array([1, 3])]
回答3:
Your current approach is O(N**2)
, use a dictionary to do it in O(N)
time:
>>> from collections import defaultdict
>>> traced_descIDs = [1, 345, 23, 345, 90, 1]
>>> d = defaultdict(list)
>>> for i, x in enumerate(traced_descIDs):
... d[x].append(i)
...
>>> for k, v in d.items():
... if len(v) == 1:
... del d[k]
...
>>> d
defaultdict(<type 'list'>, {1: [0, 5], 345: [1, 3]})
And to get the items and indices:
>>> from itertools import izip
>>> dupIds, dupInds = izip(*d.iteritems())
>>> dupIds, dupInds
((1, 345), ([0, 5], [1, 3]))
Note that if you want to preserver the order of items in dupIds
then use collections.OrderedDict and dict.setdefault() method.
回答4:
td = np.array(traced_descIDs)
si = np.argsort(td)
td[si][np.append(False, np.diff(td[si]) == 0)]
That gives you:
array([ 1, 345])
I haven't figured out the second part quite yet, but maybe this will be inspiration enough for you, or maybe I'll get back to it. :)
回答5:
A solution of the same vectorized efficiency as proposed by Jaime is embedded in the numpy_indexed package (disclaimer: I am its author):
import numpy_indexed as npi
print(npi.group_by(traced_descIDs, np.arange(len(traced_descIDs))))
This gets us most of the way there; but if we also want to filter out singleton groups while avoiding any python loops and staying entirely vectorized, we can go a little lower level, and do:
g = npi.group_by(traced_descIDs)
unique = g.unique
idx = g.split_array_as_list(np.arange(len(traced_descIDs)))
duplicates = unique[g.count>1]
idx_duplicates = np.asarray(idx)[g.count>1]
print(duplicates, idx_duplicates)
来源:https://stackoverflow.com/questions/25264798/checking-for-and-indexing-non-unique-duplicate-values-in-a-numpy-array