So basically I want to count the number of occurrences a floating point appears in a given list. For example: a list of grades (all scores out of 100) are inputted by the user and they are sorted in groups of ten. How many times do scores from 0-10, 10-20, 20-30.. etc) appear? Like test score distribution. I know I can use the count function but since I'm not looking for specific numbers I'm having trouble. Is there a away to combine the count and range? Thanks for any help.
To group the data, divide it by the interval width. To count the number in each group, consider using collections.Counter. Here's a worked out example with documentation and a test:
from collections import Counter
def histogram(iterable, low, high, bins):
'''Count elements from the iterable into evenly spaced bins
>>> scores = [82, 85, 90, 91, 70, 87, 45]
>>> histogram(scores, 0, 100, 10)
[0, 0, 0, 0, 1, 0, 0, 1, 3, 2]
'''
step = (high - low + 0.0) / bins
dist = Counter((float(x) - low) // step for x in iterable)
return [dist[b] for b in range(bins)]
if __name__ == '__main__':
import doctest
print doctest.testmod()
If you are fine with using the external library NumPy, then you just need to call numpy.histogram()
:
>>> data = [82, 85, 90, 91, 70, 87, 45]
>>> counts, bins = numpy.histogram(data, bins=10, range=(0, 100))
>>> counts
array([0, 0, 0, 0, 1, 0, 0, 1, 3, 2])
>>> bins
array([ 0., 10., 20., 30., 40., 50., 60., 70., 80.,
90., 100.])
decs = [int(x/10) for x in scores]
maps scores from 0-9 -> 0, 10-19 -> 1, et cetera. Then just count the occurrences of 0, 1, 2, 3, and so on (via something like collections.Counter
), and map back to ranges from there.
This method uses bisect which can be more efficient, but it requires that you sort the scores first.
from bisect import bisect
import random
scores = [random.randint(0,100) for _ in xrange(100)]
bins = [20, 40, 60, 80, 100]
scores.sort()
counts = []
last = 0
for range_max in bins:
i = bisect(scores, range_max, last)
counts.append(i - last)
last = i
I wouldn't expect you to install numpy just for this, but if you already have numpy you can use numpy.histogram
.
UPDATE
First, using bisect is more flexible. Using [i//n for i in scores]
requires that all the bins are the same size. Using bisect allows the bins to have arbitrary limits. Also i//n
means the ranges are [lo, hi). Using bisect the ranges are (lo, hi] but you can use bisect_left if you want [lo, hi).
Second bisect is faster, see timings bellow. I've replaced scores.sort() with the slower sorted(scores) because the sorting is the slowest step and I didn't want to bias the times with a pre-sorted array, but the OP says his/her array is already sorted so bisect could make even more sense in that case.
setup="""
from bisect import bisect_left
import random
from collections import Counter
def histogram(iterable, low, high, bins):
step = (high - low) / bins
dist = Counter(((x - low + 0.) // step for x in iterable))
return [dist[b] for b in xrange(bins)]
def histogram_bisect(scores, groups):
scores = sorted(scores)
counts = []
last = 0
for range_max in groups:
i = bisect_left(scores, range_max, last)
counts.append(i - last)
last = i
return counts
def histogram_simple(scores, bin_size):
scores = [i//bin_size for i in scores]
return [scores.count(i) for i in range(max(scores)+1)]
scores = [random.randint(0,100) for _ in xrange(100)]
bins = range(10, 101, 10)
"""
from timeit import repeat
t = repeat('C = histogram(scores, 0, 100, 10)', setup=setup, number=10000)
print min(t)
#.95
t = repeat('C = histogram_bisect(scores, bins)', setup=setup, number=10000)
print min(t)
#.22
t = repeat('histogram_simple(scores, 10)', setup=setup, number=10000)
print min(t)
#.36
来源:https://stackoverflow.com/questions/9543935/python-count-occurrences-of-certain-ranges-in-a-list