点击上方「蓝字」关注我们

文章目录

第3章分组

一、SAC过程

1. 内涵
2. apply过程

二、groupby函数

1. 分组函数的基本内容：
2. groupby对象的特点

三、聚合、过滤和变换

1. 聚合（Aggregation）
2. 过滤（Filteration）
3. 变换（Transformation）

四、apply函数

1. apply函数的灵活性
2. 用apply同时统计多个指标

第3章分组

import numpy as np
import pandas as pd
df = pd.read_csv('data/table.csv',index_col='ID')
df

一、SAC过程

1. 内涵

1.SAC指的是分组操作中的split-apply-combine过程
2.其中split指基于某一些规则，将数据拆成若干组，apply是指对每一组独立地使用函
数，combine指将每一组的结果组合成某一类数据结构

2. apply过程

在该过程中，我们实际往往会遇到四类问题：
1.整合（Aggregation）——即分组计算统计量（如求均值、求每组元素个数）
2.变换（Transformation）——即分组对每个单元的数据进行操作（如元素标准化）
3.过滤（Filtration）——即按照某些规则筛选出一些组（如选出组内某一指标小于50的组）
综合问题——即前面提及的三种问题的混合

二、groupby函数

1. 分组函数的基本内容：

（a）根据某一列分组

grouped_single = df.groupby('School')

经过groupby后会生成一个groupby对象，该对象本身不会返回任何东西，只有当相应的方法被调用才会起作用

例如取出某一个组：

grouped_single.get_group('S_2').head()

（b）根据某几列分组

grouped_mul = df.groupby(['School','Class'])
grouped_mul.get_group(('S_1','C_3'))

（c）组容量与组数

grouped_single.size()

grouped_mul.size()

grouped_single.ngroups

grouped_mul.ngroups

（d）组的遍历

for name,group in grouped_single:
print(name)
    display(group.head())

（e）level参数（用于多级索引）和axis参数

 df.set_index(['Gender','School']).groupby(level=1,axis=0).get_group('S_1')

2. groupby对象的特点

（a）查看所有可调用的方法
由此可见，groupby对象可以使用相当多的函数，灵活程度很高

print([attr for attr in dir(grouped_single) if not attr.startswith('_')])

（b）分组对象的head和first
对分组对象使用head函数，返回的是每个组的前几行，而不是数据集前几行

grouped_single.head(3)

first显示的是以分组为索引的每组的第一个分组信息

grouped_single.first()

（c）分组依据

对于groupby函数而言，分组的依据是非常自由的，只要是与数据框长度相同的列表即可，同时支持函数型分组

np.random.choice(['a','b','c'],df.shape[0])

df.groupby(np.random.choice(['a','b','c'],df.shape[0])).get_group('b').head()
#相当于将np.random.choice(['a','b','c'],df.shape[0])当做新的一列进行分组

从原理上说，我们可以看到利用函数时，传入的对象就是索引，因此根据这一特性可以做一些复杂的操作

df[:5].groupby(lambda x:print(x)).head(0)

根据奇偶行分组

df.groupby(lambda x:'奇数行' if not df.index.get_loc(x)%2==1 else '偶数行').groups

如果是多层索引，那么lambda表达式中的输入就是元组，下面实现的功能为查看两所学校中男女生分别均分是否及格

注意：此处只是演示groupby的用法，实际操作不会这样写

math_score = df.set_index(['Gender','School'])['Math'].sort_index()
grouped_score = df.set_index(['Gender','School']).sort_index().groupby(lambda x:(x,'均分及格' if math_score[x].mean()>=60 else '均分不及格'))
for name,_ in grouped_score:
print(name)

（d）groupby的[]操作
可以用[]选出groupby对象的某个或者某几个列，上面的均分比较可以如下简洁地写出：

df.groupby(['Gender','School'])['Math'].mean()>=60

用列表可选出多个属性列：

df.groupby(['Gender','School'])[['Math','Height']].mean()

（e）连续型变量分组

例如利用cut函数对数学成绩分组：

bins = [0,40,60,80,90,100]
cuts = pd.cut(df['Math'],bins=bins) #可选label添加自定义标签
df.groupby(cuts)['Math'].count()

三、聚合、过滤和变换

1. 聚合（Aggregation）

（a）常用聚合函数

所谓聚合就是把一堆数，变成一个标量，因此mean/sum/size/count/std/var/sem/describe/first/last/nth/min/max都是聚合函数

为了熟悉操作，不妨验证标准误sem函数，它的计算公式是： \frac{组内标准差}{\sqrt{组容量}} ，下面进行验证：

group_m = grouped_single['Math']
group_m.std().values/np.sqrt(group_m.count().values)== group_m.sem().values

（b）同时使用多个聚合函数

group_m.agg(['sum','mean','std'])

利用元组进行重命名

group_m.agg([('rename_sum','sum'),('rename_mean','mean')])

指定哪些函数作用哪些列

grouped_mul.agg({'Math':['mean','max'],'Height':'var'})

（c）使用自定义函数

grouped_single['Math'].agg(lambda x:print(x.head(),'间隔'))
#可以发现，agg函数的传入是分组逐列进行的，有了这个特性就可以做许多事情

官方没有提供极差计算的函数，但通过agg可以容易地实现组内极差计算

grouped_single['Math'].agg(lambda x:x.max()-x.min())

（d）利用NamedAgg函数进行多个聚合

注意：不支持lambda函数，但是可以使用外置的def函数

def R1(x):
return x.max()-x.min()
def R2(x):
return x.max()-x.median()
grouped_single['Math'].agg(min_score1=pd.NamedAgg(column='col1', aggfunc=R1),
                           max_score1=pd.NamedAgg(column='col2', aggfunc='max'),
                           range_score2=pd.NamedAgg(column='col3', aggfunc=R2)).head()

（e）带参数的聚合函数

判断是否组内数学分数至少有一个值在50-52之间：

def f(s,low,high):
return s.between(low,high).max()
grouped_single['Math'].agg(f,50,52)

如果需要使用多个函数，并且其中至少有一个带参数，则使用wrap技巧：

def f_test(s,low,high):
return s.between(low,high).max()
def agg_f(f_mul,name,*args,**kwargs):
def wrapper(x):
return f_mul(x,*args,**kwargs)
    wrapper.__name__ = name
return wrapper
new_f = agg_f(f_test,'at_least_one_in_50_52',50,52)
grouped_single['Math'].agg([new_f,'mean']).head()

2. 过滤（Filteration）

filter函数是用来筛选某些组的（务必记住结果是组的全体），因此传入的值应当是布尔标量

grouped_single[['Math','Physics']].filter(lambda x:(x['Math']>32).all()).head()

3. 变换（Transformation）

（a）传入对象

transform函数中传入的对象是组内的列，并且返回值需要与列长完全一致

grouped_single[['Math','Height']].transform(lambda x:x-x.min()).head()

如果返回了标量值，那么组内的所有元素会被广播为这个值

grouped_single[['Math','Height']].transform(lambda x:x.mean()).head()

（b）利用变换方法进行组内标准化

grouped_single[['Math','Height']].transform(lambda x:(x-x.mean())/x.std()).head()

（c）利用变换方法进行组内缺失值的均值填充

df_nan = df[['Math','School']].copy().reset_index()
df_nan.loc[np.random.randint(0,df.shape[0],25),['Math']]=np.nan
df_nan.head()

df_nan.groupby('School').transform(lambda x: x.fillna(x.mean())).join(df.reset_index()['School']).head()

四、apply函数

1. apply函数的灵活性

可能在所有的分组函数中，apply是应用最为广泛的，这得益于它的灵活性：

对于传入值而言，从下面的打印内容可以看到是以分组的表传入apply中：

df.groupby('School').apply(lambda x:print(x.head(1)))

apply函数的灵活性很大程度来源于其返回值的多样性：

① 标量返回值

df[['School','Math','Height']].groupby('School').apply(lambda x:x.max())

② 列表返回值

df[['School','Math','Height']].groupby('School').apply(lambda x:x-x.min()).head()

③ 数据框返回值

df[['School','Math','Height']].groupby('School')\
.apply(lambda x:pd.DataFrame({'col1':x['Math']-x['Math'].max(),
'col2':x['Math']-x['Math'].min(),
'col3':x['Height']-x['Height'].max(),
'col4':x['Height']-x['Height'].min()})).head()

2. 用apply同时统计多个指标

此处可以借助OrderedDict工具进行快捷的统计：

from collections import OrderedDict
def f(df):
    data = OrderedDict()
    data['M_sum'] = df['Math'].sum()
    data['W_var'] = df['Weight'].var()
    data['H_mean'] = df['Height'].mean()
return pd.Series(data)
grouped_single.apply(f)

代码和数据地址：https://github.com/XiangLinPro/pandas

所有巧合的是要么是上天注定要么是一个人偷偷的在努力。

公众号，关注回复【电子书】有惊喜，资源多多。

个人微信公众号，专注于学习资源、笔记分享,欢迎关注。我们一起成长，一起学习。一直纯真着，善良着，温情地热爱生活,，如果觉得有点用的话，请不要吝啬你手中点赞的权力,谢谢我亲爱的读者朋友。

The only person you need to compare yourself to is who you have been. The only person you need to be better than is who you are.
你需要跟自己比的唯一一个人，就是曾经的自己。你需要比一个人变得更好，那个人就是现在的你。

给大家推荐一个Github,上面非常非常多的干货，保证让你满意：https://github.com/XiangLinPro/IT_book

关于Datawhale

Datawhale是一个专注于数据科学与AI领域的开源组织，汇集了众多领域院校和知名企业的优秀学习者，聚合了一群有开源精神和探索精神的团队成员。Datawhale以“for the learner，和学习者一起成长”为愿景，鼓励真实地展现自我、开放包容、互信互助、敢于试错和勇于担当。同时Datawhale 用开源的理念去探索开源内容、开源学习和开源方案，赋能人才培养，助力人才成长，建立起人与人，人与知识，人与企业和人与未来的联结。

The mature never ask about the past;
the wise never ask about the present
and the open-minded never ask about the future.

成熟的人不问过去，
聪明的人不问现在，
豁达的人不问未来。