pandas create new column based on values from other columns / apply a function of multiple columns, row-wise

后端 未结 5 612
广开言路
广开言路 2020-11-22 06:24

I want to apply my custom function (it uses an if-else ladder) to these six columns (ERI_Hispanic, ERI_AmerInd_AKNatv, ERI_Asian,

5条回答
  •  一生所求
    2020-11-22 06:37

    The answers above are perfectly valid, but a vectorized solution exists, in the form of numpy.select. This allows you to define conditions, then define outputs for those conditions, much more efficiently than using apply:


    First, define conditions:

    conditions = [
        df['eri_hispanic'] == 1,
        df[['eri_afr_amer', 'eri_asian', 'eri_hawaiian', 'eri_nat_amer', 'eri_white']].sum(1).gt(1),
        df['eri_nat_amer'] == 1,
        df['eri_asian'] == 1,
        df['eri_afr_amer'] == 1,
        df['eri_hawaiian'] == 1,
        df['eri_white'] == 1,
    ]
    

    Now, define the corresponding outputs:

    outputs = [
        'Hispanic', 'Two Or More', 'A/I AK Native', 'Asian', 'Black/AA', 'Haw/Pac Isl.', 'White'
    ]
    

    Finally, using numpy.select:

    res = np.select(conditions, outputs, 'Other')
    pd.Series(res)
    

    0           White
    1        Hispanic
    2           White
    3           White
    4           Other
    5           White
    6     Two Or More
    7           White
    8    Haw/Pac Isl.
    9           White
    dtype: object
    

    Why should numpy.select be used over apply? Here are some performance checks:

    df = pd.concat([df]*1000)
    
    In [42]: %timeit df.apply(lambda row: label_race(row), axis=1)
    1.07 s ± 4.16 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
    
    In [44]: %%timeit
        ...: conditions = [
        ...:     df['eri_hispanic'] == 1,
        ...:     df[['eri_afr_amer', 'eri_asian', 'eri_hawaiian', 'eri_nat_amer', 'eri_white']].sum(1).gt(1),
        ...:     df['eri_nat_amer'] == 1,
        ...:     df['eri_asian'] == 1,
        ...:     df['eri_afr_amer'] == 1,
        ...:     df['eri_hawaiian'] == 1,
        ...:     df['eri_white'] == 1,
        ...: ]
        ...:
        ...: outputs = [
        ...:     'Hispanic', 'Two Or More', 'A/I AK Native', 'Asian', 'Black/AA', 'Haw/Pac Isl.', 'White'
        ...: ]
        ...:
        ...: np.select(conditions, outputs, 'Other')
        ...:
        ...:
    3.09 ms ± 17 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
    

    Using numpy.select gives us vastly improved performance, and the discrepancy will only increase as the data grows.

提交回复
热议问题