fast way to get index of top-k elements of every column in a pandas dataframe

梦想与她 提交于 2019-12-04 08:33:06

I think numpy has a good solution for this that's fast and you can format the output however you want.

In [2]: df = pd.DataFrame(data=np.random.randint(0, 1000, (200, 500000)), 
                      columns=range(500000), index=range(200))

In [3]: def top_k(x,k):
             ind=np.argpartition(x,-1*k)[-1*k:]
             return ind[np.argsort(x[ind])]

In [69]: %time np.apply_along_axis(lambda x: top_k(x,2),0,df.as_matrix())
CPU times: user 5.91 s, sys: 40.7 ms, total: 5.95 s
Wall time: 6 s

Out[69]:
array([[ 14,  54],
       [178, 141],
       [ 49, 111],
       ...,
       [ 24, 122],
       [ 55,  89],
       [  9, 175]])

Pretty fast compared to the pandas solution (which is cleaner IMO but we're going for speed here):

In [41]: %time np.array([df[c].nlargest(2).index.values for c in df])
CPU times: user 3min 43s, sys: 6.58 s, total: 3min 49s
Wall time: 4min 8s

Out[41]:
array([[ 54,  14],
       [141, 178],
       [111,  49],
       ...,
       [122,  24],
       [ 89,  55],
       [175,   9]])

The lists are in reverse order of each other (you can easily fix this by reversing sort in the numpy version)

Note that in the example due to random int generation we can likely have more than k values that are equal and max so indices returned may not agree among all methods but all will yield a valid result (you will get k indices that match the max values in the column)

Pandas has an efficient nlargest operation you can use that is faster than a full sort. It will still take awhile to apply across 500,000 columns.

In [1]: df = pd.DataFrame(data=np.random.randint(0, 100, (200, 500000)), 
                          columns=range(500000), index=range(200))

In [2]: %time np.array([df[c].nlargest(2).index.values for c in df])
Wall time: 2min 57s
Out[2]: 
array([[171,   1],
       [ 42,  78],

As @EdChum noted, you probably don't want to store as tuples, it would be a lot more efficient to use two arrays or some other strategy.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!