Priority based categorization using pandas/python

后端 未结 1 1378
执笔经年
执笔经年 2020-12-11 23:09

I\'ve invoice related data in the below Dataframe and lists of codes

df = pd.DataFrame({
    \'invoice\':[1,1,2,2,2,3,3,3,4,4,4,5,5,6,6,6,7],
    \'code\':[10         


        
相关标签:
1条回答
  • 2020-12-11 23:51

    You can try to use np.select

    df['category'] = np.select([
        df.groupby('invoice')['qty'].transform('sum') >= 10,
        df['code'].isin(Milk).groupby(df.invoice).transform('any'),
        (df['qty']*df['code'].isin(Juice)).groupby(df.invoice).transform('sum') == 1,
        (df['qty']*df['code'].isin(Juice)).groupby(df.invoice).transform('sum') > 1,
        df['code'].isin(Hot).groupby(df.invoice).transform('any'),
        df['code'].isin(Dessert).groupby(df.invoice).transform('any')
    ],
        ['Mega','Healthy','OneJuice','ManyJuice','HotLovers','DessertLovers'],
        'Other'
    )
    print(df)
    

    Output

        invoice  code  qty       category
    0         1   101    2       OneJuice
    1         1   104    1       OneJuice
    2         2   105    1        Healthy
    3         2   101    3        Healthy
    4         2   106    2        Healthy
    5         3   106    4           Mega
    6         3   104    7           Mega
    7         3   101    1           Mega
    8         4   104    1      ManyJuice
    9         4   105    1      ManyJuice
    10        4   111    1      ManyJuice
    11        5   109    4      HotLovers
    12        5   111    2      HotLovers
    13        6   110    1  DessertLovers
    14        6   101    2  DessertLovers
    15        6   114    2  DessertLovers
    16        7   104    2      ManyJuice
    

    Micro-Benchmark

    pd.show_versions()
    
    commit           : None
    python           : 3.7.5.final.0
    python-bits      : 64
    OS               : Linux
    OS-release       : 4.4.0-18362-Microsoft
    machine          : x86_64
    processor        : x86_64
    byteorder        : little
    LC_ALL           : None
    LANG             : C.UTF-8
    LOCALE           : en_US.UTF-8
    
    pandas           : 0.25.3
    numpy            : 1.17.4
    

    Data was created with

    def make_data(n):
         return pd.DataFrame({
        'invoice':np.arange(n)//3,
        'code':np.random.choice(np.arange(101,112),n),
        'qty':np.random.choice(np.arange(1,8), n, p=[10/25,10/25,1/25,1/25,1/25,1/25,1/25])
    })
    

    Results

    perfplot.show(
        setup=make_data,
        kernels=[get_category, get_with_np_select],
        n_range=[2**k for k in range(8, 20)],
        logx=True,
        logy=True,
        equality_check=False,
        xlabel='len(df)')
    

    0 讨论(0)
提交回复
热议问题