pandas create new column based on values from other columns / apply a function of multiple columns, row-wise

后端 未结 5 696
广开言路
广开言路 2020-11-22 06:24

I want to apply my custom function (it uses an if-else ladder) to these six columns (ERI_Hispanic, ERI_AmerInd_AKNatv, ERI_Asian,

5条回答
  •  我在风中等你
    2020-11-22 06:25

    try this,

    df.loc[df['eri_white']==1,'race_label'] = 'White'
    df.loc[df['eri_hawaiian']==1,'race_label'] = 'Haw/Pac Isl.'
    df.loc[df['eri_afr_amer']==1,'race_label'] = 'Black/AA'
    df.loc[df['eri_asian']==1,'race_label'] = 'Asian'
    df.loc[df['eri_nat_amer']==1,'race_label'] = 'A/I AK Native'
    df.loc[(df['eri_afr_amer'] + df['eri_asian'] + df['eri_hawaiian'] + df['eri_nat_amer'] + df['eri_white']) > 1,'race_label'] = 'Two Or More'
    df.loc[df['eri_hispanic']==1,'race_label'] = 'Hispanic'
    df['race_label'].fillna('Other', inplace=True)
    

    O/P:

         lname   fname rno_cd  eri_afr_amer  eri_asian  eri_hawaiian  \
    0      MOST    JEFF      E             0          0             0   
    1    CRUISE     TOM      E             0          0             0   
    2      DEPP  JOHNNY    NaN             0          0             0   
    3     DICAP     LEO    NaN             0          0             0   
    4    BRANDO  MARLON      E             0          0             0   
    5     HANKS     TOM    NaN             0          0             0   
    6    DENIRO  ROBERT      E             0          1             0   
    7    PACINO      AL      E             0          0             0   
    8  WILLIAMS   ROBIN      E             0          0             1   
    9  EASTWOOD   CLINT      E             0          0             0   
    
       eri_hispanic  eri_nat_amer  eri_white rno_defined    race_label  
    0             0             0          1       White         White  
    1             1             0          0       White      Hispanic  
    2             0             0          1     Unknown         White  
    3             0             0          1     Unknown         White  
    4             0             0          0       White         Other  
    5             0             0          1     Unknown         White  
    6             0             0          1       White   Two Or More  
    7             0             0          1       White         White  
    8             0             0          0       White  Haw/Pac Isl.  
    9             0             0          1       White         White 
    

    use .loc instead of apply.

    it improves vectorization.

    .loc works in simple manner, mask rows based on the condition, apply values to the freeze rows.

    for more details visit, .loc docs

    Performance metrics:

    Accepted Answer:

    def label_race (row):
       if row['eri_hispanic'] == 1 :
          return 'Hispanic'
       if row['eri_afr_amer'] + row['eri_asian'] + row['eri_hawaiian'] + row['eri_nat_amer'] + row['eri_white'] > 1 :
          return 'Two Or More'
       if row['eri_nat_amer'] == 1 :
          return 'A/I AK Native'
       if row['eri_asian'] == 1:
          return 'Asian'
       if row['eri_afr_amer']  == 1:
          return 'Black/AA'
       if row['eri_hawaiian'] == 1:
          return 'Haw/Pac Isl.'
       if row['eri_white'] == 1:
          return 'White'
       return 'Other'
    
    df=pd.read_csv('dataser.csv')
    df = pd.concat([df]*1000)
    
    %timeit df.apply(lambda row: label_race(row), axis=1)
    

    1.15 s ± 46.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

    My Proposed Answer:

    def label_race(df):
        df.loc[df['eri_white']==1,'race_label'] = 'White'
        df.loc[df['eri_hawaiian']==1,'race_label'] = 'Haw/Pac Isl.'
        df.loc[df['eri_afr_amer']==1,'race_label'] = 'Black/AA'
        df.loc[df['eri_asian']==1,'race_label'] = 'Asian'
        df.loc[df['eri_nat_amer']==1,'race_label'] = 'A/I AK Native'
        df.loc[(df['eri_afr_amer'] + df['eri_asian'] + df['eri_hawaiian'] + df['eri_nat_amer'] + df['eri_white']) > 1,'race_label'] = 'Two Or More'
        df.loc[df['eri_hispanic']==1,'race_label'] = 'Hispanic'
        df['race_label'].fillna('Other', inplace=True)
    df=pd.read_csv('s22.csv')
    df = pd.concat([df]*1000)
    
    %timeit label_race(df)
    

    24.7 ms ± 1.7 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

提交回复
热议问题