问题
Suppose that I have a structured array of students (strings) and test scores (ints), where each entry is the score that a specific student received on a specific test. Each student has multiple entries in this array, naturally.
Example
import numpy
grades = numpy.array([('Mary', 96), ('John', 94), ('Mary', 88), ('Edgar', 89), ('John', 84)],
dtype=[('student', 'a50'), ('score', 'i')])
print grades
#[('Mary', 96) ('John', 94) ('Mary', 88) ('Edgar', 89) ('John', 84)]
How do I easily compute the average score of each student? In other words, how do I take the mean of the array in the 'score' dimension? I'd like to do
grades.mean('score')
and have Numpy return
[('Mary', 92), ('John', 89), ('Edgar', 89)]
but Numpy complains
TypeError: an integer is required
Is there a Numpy-esque way to do this easily? I think it might involve taking a view of the structured array with a different dtype. Any help would be appreciated. Thanks.
Edit
>>> grades = numpy.zeros(5, dtype=[('student', 'a50'), ('score', 'i'), ('testid', 'i'])
>>> grades[0] = ('Mary', 96, 1)
>>> grades[1] = ('John', 94, 1)
>>> grades[2] = ('Mary', 88, 2)
>>> grades[3] = ('Edgar', 89, 1)
>>> grades[4] = ('John', 84, 2)
>>> np.mean(grades, 'testid')
TypeError: an integer is required
回答1:
NumPy isn't designed to be able to group rows together and apply aggregate functions to those groups. You could:
- use itertools.groupby and reconstruct the array;
- use Pandas, which is based on NumPy and is great at grouping; or
- add another dimension to the array for the test id (so this case would be a 2x3 array, because it looks like there were two tests).
Here's the itertools
solution, but as you can see it's quite complicated and inefficient. I'd recommend one of the other two methods.
np.array([(k, np.array(list(g), dtype=grades.dtype).view(np.recarray)['score'].mean())
for k, g in groupby(np.sort(grades, order='student').view(np.recarray),
itemgetter('student'))], dtype=grades.dtype)
回答2:
matplotlib.mlab.rec_groupby was exactly what I was looking for.
回答3:
A little bit faster and simpler solution based on itertools
, without using view(), is
[(k,e['score'][list(g)].mean()) for k, g in groupby(argsort(e),e['student'].__getitem__ )]
This is the same idea of ecatmur, but works in terms of indices employing argsort() instead of sort.
回答4:
collapseByField(grades,'student') gives what you want, after:
def collapseByField(e,collapsefield,keepFields=None,agg=None):
import numpy as np
assert isinstance(e,np.ndarray) # Structured array
if agg is None:
agg=np.mean
if keepFields is None:
newf=[(n,agg,n) for n in e.dtype.names if n not in (collapsefield)]
import matplotlib as mpl
return(mpl.mlab.rec_groupby(e,[collapsefield],newf))
来源:https://stackoverflow.com/questions/11989164/numpy-mean-structured-array