Python - Count occurrences of certain ranges in a list

北慕城南 提交于 2019-11-30 15:39:27

To group the data, divide it by the interval width. To count the number in each group, consider using collections.Counter. Here's a worked out example with documentation and a test:

from collections import Counter

def histogram(iterable, low, high, bins):
    '''Count elements from the iterable into evenly spaced bins

        >>> scores = [82, 85, 90, 91, 70, 87, 45]
        >>> histogram(scores, 0, 100, 10)
        [0, 0, 0, 0, 1, 0, 0, 1, 3, 2]

    '''
    step = (high - low + 0.0) / bins
    dist = Counter((float(x) - low) // step for x in iterable)
    return [dist[b] for b in range(bins)]

if __name__ == '__main__':
    import doctest
    print doctest.testmod()

If you are fine with using the external library NumPy, then you just need to call numpy.histogram():

>>> data = [82, 85, 90, 91, 70, 87, 45]
>>> counts, bins = numpy.histogram(data, bins=10, range=(0, 100))
>>> counts
array([0, 0, 0, 0, 1, 0, 0, 1, 3, 2])
>>> bins
array([   0.,   10.,   20.,   30.,   40.,   50.,   60.,   70.,   80.,
         90.,  100.])
decs = [int(x/10) for x in scores]

maps scores from 0-9 -> 0, 10-19 -> 1, et cetera. Then just count the occurrences of 0, 1, 2, 3, and so on (via something like collections.Counter), and map back to ranges from there.

This method uses bisect which can be more efficient, but it requires that you sort the scores first.

from bisect import bisect
import random

scores = [random.randint(0,100) for _ in xrange(100)]
bins = [20, 40, 60, 80, 100]

scores.sort()
counts = []
last = 0
for range_max in bins:
    i = bisect(scores, range_max, last)
    counts.append(i - last)
    last = i

I wouldn't expect you to install numpy just for this, but if you already have numpy you can use numpy.histogram.

UPDATE

First, using bisect is more flexible. Using [i//n for i in scores] requires that all the bins are the same size. Using bisect allows the bins to have arbitrary limits. Also i//n means the ranges are [lo, hi). Using bisect the ranges are (lo, hi] but you can use bisect_left if you want [lo, hi).

Second bisect is faster, see timings bellow. I've replaced scores.sort() with the slower sorted(scores) because the sorting is the slowest step and I didn't want to bias the times with a pre-sorted array, but the OP says his/her array is already sorted so bisect could make even more sense in that case.

setup="""
from bisect import bisect_left
import random
from collections import Counter

def histogram(iterable, low, high, bins):
    step = (high - low) / bins
    dist = Counter(((x - low + 0.) // step for x in iterable))
    return [dist[b] for b in xrange(bins)]

def histogram_bisect(scores, groups):
    scores = sorted(scores)
    counts = []
    last = 0
    for range_max in groups:
        i = bisect_left(scores, range_max, last)
        counts.append(i - last)
        last = i
    return counts

def histogram_simple(scores, bin_size):
    scores = [i//bin_size for i in scores]
    return [scores.count(i) for i in range(max(scores)+1)]

scores = [random.randint(0,100) for _ in xrange(100)]
bins = range(10, 101, 10)
"""
from timeit import repeat
t = repeat('C = histogram(scores, 0, 100, 10)', setup=setup, number=10000)
print min(t)
#.95
t = repeat('C = histogram_bisect(scores, bins)', setup=setup, number=10000)
print min(t)
#.22
t = repeat('histogram_simple(scores, 10)', setup=setup, number=10000)
print min(t)
#.36
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!