Efficiently determining if large sorted numpy array has only unique values

问题

I have a very large numpy array and I want to sort it and test if it is unique.

I'm aware of the function numpy.unique but it sorts the array another time to achieve it.

The reason I need the array sorted a priori is because the returned keys from the argsort function will be used to reorder another array.

I'm looking for a way to do both (argsort and unique test) without the need to sort the array again.

Example code:

import numpy as np
import numpy.random

# generating random arrays with 2 ^ 27 columns (it can grow even bigger!)
slices = np.random.random_integers(2 ** 32, size = 2 ** 27)
values = np.random.random_integers(2 ** 32, size = 2 ** 27)

# get an array of keys to sort slices AND values
# this operation takes a long time
sorted_slices = slices.argsort()

# sort both arrays
# it would be nice to make this operation in place
slices = slices[sorted_slices]
values = values[sorted_slices]

# test 'uniqueness'
# here, the np.unique function sorts the array again
if slices.shape[0] == np.unique(slices).shape[0]:
    print('it is unique!')
else:
    print('not unique!')

Both the arrays slices and values have 1 row and the same (huge) number of columns.

Thanks in advance.

回答1:

You can check whether there are two or more equal values next to each other (non-unique values in a sorted array) by comparing their difference to 0

numpy.any(numpy.diff(slices) == 0)

Be aware though that numpy will create two intermediate arrays: one with the difference values, one with boolean values.

回答2:

Here's an approach making use of slicing and instead of actual differentiation, we can just compare each element against the previous one without actually computing the differentiation value, like so -

~((slices[1:] == slices[:-1]).any())

Runtime test -

In [54]: slices = np.sort(np.random.randint(0,100000000,(10000000)))

# @Nils Werner's soln
In [55]: %timeit ~np.any(np.diff(slices) == 0)
100 loops, best of 3: 18.5 ms per loop

# @Marco's suggestion in comments
In [56]: %timeit np.diff(slices).all()
10 loops, best of 3: 20.6 ms per loop

# Proposed soln in this post
In [57]: %timeit ~((slices[1:] == slices[:-1]).any())
100 loops, best of 3: 6.12 ms per loop

来源：https://stackoverflow.com/questions/42652023/efficiently-determining-if-large-sorted-numpy-array-has-only-unique-values

标签

python

arrays

sorting

numpy

unique