Groupby on columns with overlapping groups

强颜欢笑 提交于 2019-12-11 16:12:37

问题


Continuing from my previous question.

This produces a dafatrame with 81 columns and filled with random numbers:

import pandas as pd
import itertools
import numpy as np

col = "A,B,C".split(',')
col1 = "1,2,3,4,5,6,7,8,9".split(',')
col2 = "E,F,G".split(',')

all_dims = [col, col1, col2]
all_keys = ['.'.join(i) for i in itertools.product(*all_dims)]
rng = pd.date_range(end=pd.Timestamp.today().date(), periods=12, freq='M')
df = pd.DataFrame(np.random.randint(0, 1000, size=(len(rng), len(all_keys))), columns=all_keys, index=rng)

A dataframe with these 81 columns names:

 ['A.1.E', 'A.1.F', 'A.1.G', 'A.2.E', 'A.2.F', 'A.2.G', 'A.3.E', 'A.3.F', 'A.3.G', 'A.4.E', 
'A.4.F', 'A.4.G', 'A.5.E', 'A.5.F', 'A.5.G', 'A.6.E','A.6.F', 'A.6.G', 'A.7.E', 'A.7.F', 
'A.7.G', 'A.8.E', 'A.8.F', 'A.8.G', 'A.9.E', 'A.9.F', 'A.9.G', 'B.1.E', 'B.1.F', 'B.1.G', 
'B.2.E', 'B.2.F', 'B.2.G', 'B.3.E', 'B.3.F', 'B.3.G', 'B.4.E', 'B.4.F', 'B.4.G', 'B.5.E', 
'B.5.F', 'B.5.G', 'B.6.E', 'B.6.F', 'B.6.G', 'B.7.E', 'B.7.F', 'B.7.G', 'B.8.E', 'B.8.F', 
'B.8.G', 'B.9.E', 'B.9.F', 'B.9.G', 'C.1.E', 'C.1.F', 'C.1.G', 'C.2.E', 'C.2.F', 'C.2.G', 
'C.3.E', 'C.3.F', 'C.3.G', 'C.4.E', 'C.4.F', 'C.4.G', 'C.5.E', 'C.5.F', 'C.5.G', 'C.6.E', 
'C.6.F', 'C.6.G', 'C.7.E', 'C.7.F', 'C.7.G', 'C.8.E', 'C.8.F', 'C.8.G', 'C.9.E', 'C.9.F','C.9.G']

Using the solution from my previous question I know how to make a grouper function that for example takes all 'A.*.E' columns (that have any number in the middle), sums them and produces a named output column called 'A.SUM.E'. And then does the same for 'A.*.F', 'A.*.G' and so on:

def grouper(col):
    c = col.split('.')
    return f'{c[0]}.SUM.{c[-1]}'

df.groupby(grouper, axis=1).sum()

My question is can a function be written in a way to produce overlapping groups also? For example, can a grouper be made that produces a SUM1 for ('A.1.E' + 'A.2.E') and a SUM2 for ('A.1.E' + 'A.3.E'), so the column 'A.1.E' would appear in both sums.


回答1:


This won't be possible with groupby as any given column can only be in one group, not in multiple groups. To solve this case you can modify my first solution to the previous question:

cols = sorted([(x[0],x[1]) for x in set([(x.split('.')[0], x.split('.')[-1]) for x in df.columns])])
for c0,c1 in cols:
    for n in range(2,10):
        df[f'{c0}.SUM{n}.{c1}'] = df.filter(regex = f'{c0}\.(1|{n})\.{c1}').sum(axis=1)

(This will work for single-digit numbers (1 through 9) in the column headers as per your example. If there are numbers > 9 you'll have to modify the regex accordingly.)



来源:https://stackoverflow.com/questions/57687349/groupby-on-columns-with-overlapping-groups

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!