I have many large (>35,000,000) lists of integers that will contain duplicates. I need to get a count for each integer in a list. The following code works, but seems slow. C
Sorting is theta(NlogN), I'd go for amortized O(N) provided by Python's hashtable implementation. Just use defaultdict(int)
for keeping counts of the integers and just iterate over the array once:
counts = collections.defaultdict(int)
for v in values:
counts[v] += 1
This is theoretically faster, unfortunately I have no way to check now. Allocating the additional memory might make it actually slower than your solution, which is in-place.
Edit: If you need to save memory try radix sort, which is much faster on integers than quicksort (which I believe is what numpy uses).