I have many large (>35,000,000) lists of integers that will contain duplicates. I need to get a count for each integer in a list. The following code works, but seems slow. C
This is a fairly old thread, but I thought I'd mention that there's a small improvement to be made on the currently-accepted solution:
def group_by_edge():
import numpy as np
values = np.array(np.random.randint(0,1<<32,size=35000000),dtype='u4')
values.sort()
edges = (values[1:] != values[:-1]).nonzero()[0] - 1
idx = np.concatenate(([0], edges, [len(values)]))
index = np.empty(len(idx) - 1, dtype= 'u4, u2')
index['f0'] = values[idx[:-1]]
index['f1'] = np.diff(idx)
This tested as about a half-second faster on my machine; not a huge improvement, but worth something. Additionally, I think it's clearer what's happening here; the two step diff approach is a bit opaque at first glance.