Versions of this question have already been asked but I have not found a satisfactory answer.
Problem: given a large numpy vector, find indices of t
As an alternative to Ami Tavory's answer, you can use a Counter from the collections package to detect duplicates. On my computer it seems to be even faster. See the function below which can also find different duplicates.
import collections
import numpy as np
def find_duplicates_original(x):
d = []
for i in range(len(x)):
for j in range(i + 1, len(x)):
if x[i] == x[j]:
d.append(j)
return d
def find_duplicates_outer(x):
a = np.equal.outer(x, x)
np.fill_diagonal(a, 0)
return np.flatnonzero(np.any(a, axis=0))
def find_duplicates_counter(x):
counter = collections.Counter(x)
values = (v for v, c in counter.items() if c > 1)
return {v: np.flatnonzero(x == v) for v in values}
n = 10000
x = np.arange(float(n))
x[n // 2] = 1
x[n // 4] = 1
>>>> find_duplicates_counter(x)
{1.0: array([ 1, 2500, 5000], dtype=int64)}
>>>> %timeit find_duplicates_original(x)
1 loop, best of 3: 12 s per loop
>>>> %timeit find_duplicates_outer(x)
10 loops, best of 3: 84.3 ms per loop
>>>> %timeit find_duplicates_counter(x)
1000 loops, best of 3: 1.63 ms per loop