I\'ve been looking for a way to efficiently check for duplicates in a numpy array and stumbled upon a question that contained an answer using this code.
What does th
The slices [1:] and [:-1] mean all but the first and all but the last elements of the array:
>>> import numpy as np
>>> s = np.array((1, 2, 2, 3)) # four element array
>>> s[1:]
array([2, 2, 3]) # last three elements
>>> s[:-1]
array([1, 2, 2]) # first three elements
therefore the comparison generates an array of boolean comparisons between each element s[x] and its "neighbour" s[x+1], which will be one shorter than the original array (as the last element has no neighbour):
>>> s[1:] == s[:-1]
array([False, True, False], dtype=bool)
and using that array to index the original array gets you the elements where the comparison is True, i.e. the elements that are the same as their neighbour:
>>> s[s[1:] == s[:-1]]
array([2])
Note that this only identifies adjacent duplicate values.
Check this out:
>>> s=numpy.array([1,3,5,6,7,7,8,9])
>>> s[1:] == s[:-1]
array([False, False, False, False, True, False, False], dtype=bool)
>>> s[s[1:] == s[:-1]]
array([7])
So s[1:] gives all numbers but the first, and s[:-1] all but the last.
Now compare these two vectors, e.g. look if two adjacent elements are the same. Last, select these elements.
It will show duplicates in a sorted array.
Basically, the inner expression s[1:] == s[:-1] compares the array with its shifted version. Imagine this:
1, [2, 3, ... n-1, n ]
- [1, 2, ... n-2, n-1] n
=> [F, F, ... F, F ]
In a sorted array, there will be no True in resulted array unless you had repetition. Then, this expression s[array] filters those which has True in the index array.
s[1:] == s[:-1] compares s without the first element with s without the last element, i.e. 0th with 1st, 1st with 2nd etc, giving you an array of len(s) - 1 boolean elements. s[boolarray] will select only those elements from s which have True at the corresponding place in boolarray. Thus, the code extracts all elements that are equal to the next element.