问题
I am having a pandas DataFrame where B contains NumPy list of fixed size.
|------|---------------|-------|
| A | B | C |
|------|---------------|-------|
| 0 | [2,3,5,6] | X |
|------|---------------|-------|
| 1 | [1,2,3,4] | X |
|------|---------------|-------|
| 2 | [2,3,6,5] | Y |
|------|---------------|-------|
| 3 | [2,3,2,3] | Y |
|------|---------------|-------|
| 4 | [2,3,4,4] | Y |
|------|---------------|-------|
| 5 | [2,3,5,6] | Z |
|------|---------------|-------|
I want to group these by columns 'C' and calculate the average of values of 'B' as list. As the table given below. I want to do this efficiently.
|----------------|-------|
| B | C |
|----------------|-------|
| [1.5,2.5,4,5] | X |
|----------------|-------|
| [2,3,4,4] | Y |
|----------------|-------|
| [2,3,5,6] | Z |
|----------------|-------|
I have considered breaking the NumPy list into individual columns. But that would be my last option.
How to write a custom aggregate function as right now column B is showing non-numeric and showing
DataError: No numeric types to aggregate
回答1:
What you need is possible with convert values to 2d array and then using np.mean
:
f = lambda x: np.mean(np.array(x.tolist()), axis=0)
df2 = df.groupby('C')['B'].apply(f).reset_index()
print (df2)
C B
0 X [1.5, 2.5, 4.0, 5.0]
1 Y [2.0, 3.0, 4.0, 4.0]
2 Z [2.0, 3.0, 5.0, 6.0]
Last option solution is possible, but less effient (thank you @Abhik Sarkar for test):
df1 = pd.DataFrame(df.B.tolist()).groupby(df['C']).mean()
df2 = pd.DataFrame({'B': df1.values.tolist(), 'C': df1.index})
print (df2)
B C
0 [1.5, 2.5, 4.0, 5.0] X
1 [2.0, 3.0, 4.0, 4.0] Y
2 [2.0, 3.0, 5.0, 6.0] Z
回答2:
Dummy data
size,list_size = 10,5
data = [{'C':random.randint(95,100),
'B':[random.randint(0,10) for i in range(list_size)]} for j in range(size)]
df = pd.DataFrame(data)
Custom Aggregation Using numpy
unique_C = df.C.unique()
data_calculated = []
axis = 0
for c in unique_C:
arr = np.reshape(np.hstack(df[df.C==c]['B']),(-1,list_size))
mean, std = arr.mean(axis=axis), arr.std(axis=axis) # other aggergation can also be added
data_calculated.append(dict(C=t,B_mean=mean, B_std=std))
new_df = pd.DataFrame(data_calculated)
来源:https://stackoverflow.com/questions/61422670/applying-a-custom-groupby-aggregate-function-to-find-average-of-numpy-array