Efficiently count the number of occurrences of unique subarrays in NumPy?

问题

I have an array of shape (128, 36, 8) and I'd like to find the number of occurrences of the unique subarrays of length 8 in the last dimension.

I'm aware of np.unique and np.bincount, but those seem to be for elements rather than subarrays. I've seen this question but it's about finding the first occurrence of a particular subarray, rather than the counts of all unique subarrays.

回答1:

The question states that the input array is of shape (128, 36, 8) and we are interested in finding unique subarrays of length 8 in the last dimension. So, I am assuming that the uniqueness is along the first two dimensions being merged together. Let us assume A as the input 3D array.

Get the number of unique subarrays

# Reshape the 3D array to a 2D array merging the first two dimensions
Ar = A.reshape(-1,A.shape[2])

# Perform lex sort and get the sorted indices and xy pairs
sorted_idx = np.lexsort(Ar.T)
sorted_Ar =  Ar[sorted_idx,:]

# Get the count of rows that have at least one TRUE value 
# indicating presence of unique subarray there
unq_out = np.any(np.diff(sorted_Ar,axis=0),1).sum()+1

Sample run -

In [159]: A # A is (2,2,3)
Out[159]: 
array([[[0, 0, 0],
        [0, 0, 2]],

       [[0, 0, 2],
        [2, 0, 1]]])

In [160]: unq_out
Out[160]: 3

Get the count of occurrences of unique subarrays

# Reshape the 3D array to a 2D array merging the first two dimensions
Ar = A.reshape(-1,A.shape[2])

# Perform lex sort and get the sorted indices and xy pairs
sorted_idx = np.lexsort(Ar.T)
sorted_Ar =  Ar[sorted_idx,:]

# Get IDs for each element based on their uniqueness
id = np.append([0],np.any(np.diff(sorted_Ar,axis=0),1).cumsum())

# Get counts for each ID as the final output
unq_count = np.bincount(id)

Sample run -

In [64]: A
Out[64]: 
array([[[0, 0, 2],
        [1, 1, 1]],

       [[1, 1, 1],
        [1, 2, 0]]])

In [65]: unq_count
Out[65]: array([1, 2, 1], dtype=int64)

回答2:

Here I've modified @Divakar's very useful answer to return the counts of the unique subarrays, as well as the subarrays themselves, so that the output is the same as that of collections.Counter.most_common():

# Get the array in 2D form.
arr = arr.reshape(-1, arr.shape[-1])

# Lexicographically sort
sorted_arr = arr[np.lexsort(arr.T), :]

# Get the indices where a new row appears
diff_idx = np.where(np.any(np.diff(sorted_arr, axis=0), 1))[0]

# Get the unique rows
unique_rows = [sorted_arr[i] for i in diff_idx] + [sorted_arr[-1]]

# Get the number of occurences of each unique array (the -1 is needed at
# the beginning, rather than 0, because of fencepost concerns)
counts = np.diff(
    np.append(np.insert(diff_idx, 0, -1), sorted_arr.shape[0] - 1))

# Return the (row, count) pairs sorted by count
return sorted(zip(unique_rows, counts), key=lambda x: x[1], reverse=True)

回答3:

I am not sure that it's the most efficient way to do it but this should work.

arr = arr.reshape(128*36,8)
unique_ = []
occurence_ = []

for sub in arr:
    if sub.tolist() not in unique_:
        unique_.append(sub.tolist())
        occurence_.append(1)
    else:
        occurence_[unique_.index(sub.tolist())]+=1
for index_,u in unique_:
   print u,"occurrence: %s"%occurence_[index_]

来源：https://stackoverflow.com/questions/30879446/efficiently-count-the-number-of-occurrences-of-unique-subarrays-in-numpy

标签

python

arrays

numpy

counting