Count how many times each row is present in numpy.array

匿名 (未验证) 提交于 2019-12-03 01:18:02

问题:

I am trying to count a number each row shows in a np.array, for example:

import numpy as np my_array = np.array([[1, 2, 0, 1, 1, 1],                      [1, 2, 0, 1, 1, 1], # duplicate of row 0                      [9, 7, 5, 3, 2, 1],                      [1, 1, 1, 0, 0, 0],                       [1, 2, 0, 1, 1, 1], # duplicate of row 0                      [1, 1, 1, 1, 1, 0]]) 

Row [1, 2, 0, 1, 1, 1] shows up 3 times.

A simple naive solution would involve converting all my rows to tuples, and applying collections.Counter, like this:

from collections import Counter def row_counter(my_array):     list_of_tups = [tuple(ele) for ele in my_array]     return Counter(list_of_tups) 

Which yields:

In [2]: row_counter(my_array) Out[2]: Counter({(1, 2, 0, 1, 1, 1): 3, (1, 1, 1, 1, 1, 0): 1, (9, 7, 5, 3, 2, 1): 1, (1, 1, 1, 0, 0, 0): 1}) 

However, I am concerned about the efficiency of my approach. And maybe there is a library that provides a built-in way of doing this. I tagged the question as pandas because I think that pandas might have the tool I am looking for.

回答1:

You can use the answer to this other question of yours to get the counts of the unique items.

In numpy 1.9 there is a return_counts optional keyword argument, so you can simply do:

>>> my_array array([[1, 2, 0, 1, 1, 1],        [1, 2, 0, 1, 1, 1],        [9, 7, 5, 3, 2, 1],        [1, 1, 1, 0, 0, 0],        [1, 2, 0, 1, 1, 1],        [1, 1, 1, 1, 1, 0]]) >>> dt = np.dtype((np.void, my_array.dtype.itemsize * my_array.shape[1])) >>> b = np.ascontiguousarray(my_array).view(dt) >>> unq, cnt = np.unique(b, return_counts=True) >>> unq = unq.view(my_array.dtype).reshape(-1, my_array.shape[1]) >>> unq array([[1, 1, 1, 0, 0, 0],        [1, 1, 1, 1, 1, 0],        [1, 2, 0, 1, 1, 1],        [9, 7, 5, 3, 2, 1]]) >>> cnt array([1, 1, 3, 1]) 

In earlier versions, you can do it as:

>>> unq, _ = np.unique(b, return_inverse=True) >>> cnt = np.bincount(_) >>> unq = unq.view(my_array.dtype).reshape(-1, my_array.shape[1]) >>> unq array([[1, 1, 1, 0, 0, 0],        [1, 1, 1, 1, 1, 0],        [1, 2, 0, 1, 1, 1],        [9, 7, 5, 3, 2, 1]]) >>> cnt array([1, 1, 3, 1]) 


回答2:

(This assumes that the array is fairly small, e.g. fewer than 1000 rows.)

Here's a short NumPy way to count how many times each row appears in an array:

>>> (my_array[:, np.newaxis] == my_array).all(axis=2).sum(axis=1) array([3, 3, 1, 1, 3, 1]) 

This counts how many times each row appears in my_array, returning an array where the first value shows how many times the first row appears, the second value shows how many times the second row appears, and so on.



回答3:

You solution is not bad, but if your matrix is large you will probably want to use a more efficient hash (compared to the default one Counter uses) for the rows before counting. You can do that with joblib:

The pandas solution is extremely slow (about 2s per loop) with this many columns. For a small matrix like the one you showed your method is faster than joblib hashing but slower than numpy:

If you have a large number of rows then you can probably find a better substitute for Counter to find hash frequencies.

Edit: Added numpy benchmarks from @acjr's solution in my system so that it is easier to compare. The numpy solution is the fastest one in both cases.



回答4:

A pandas approach might look like this

import pandas as pd  df = pd.DataFrame(my_array,columns=['c1','c2','c3','c4','c5','c6']) df.groupby(['c1','c2','c3','c4','c5','c6']).size() 

Note: supplying column names is not necessary



回答5:

A solution identical to Jaime's can be found in the numpy_indexed package (disclaimer: I am its author)

import numpy_indexed as npi npi.count(my_array) 


标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!