Using conditional to generate new column in pandas dataframe

前端 未结 5 1888
-上瘾入骨i
-上瘾入骨i 2020-11-29 02:26

I have a pandas dataframe that looks like this:

   portion  used
0        1   1.0
1        2   0.3
2        3   0.0
3        4   0.8

I\'d

相关标签:
5条回答
  • 2020-11-29 02:27
    df['TaxStatus'] = np.where(df.Public == 1, True, np.where(df.Public == 2, False))
    

    This would appear to work, except for the ValueError: either both or neither of x and y should be given

    0 讨论(0)
  • 2020-11-29 02:28

    Use np.where, is usually fast

    In [845]: df['alert'] = np.where(df.used == 1, 'Full', 
                                     np.where(df.used == 0, 'Empty', 'Partial'))
    
    In [846]: df
    Out[846]:
       portion  used    alert
    0        1   1.0     Full
    1        2   0.3  Partial
    2        3   0.0    Empty
    3        4   0.8  Partial
    

    Timings

    In [848]: df.shape
    Out[848]: (100000, 3)
    
    In [849]: %timeit df['alert'] = np.where(df.used == 1, 'Full', np.where(df.used == 0, 'Empty', 'Partial'))
    100 loops, best of 3: 6.17 ms per loop
    
    In [850]: %%timeit
         ...: df.loc[df['used'] == 1.0, 'alert'] = 'Full'
         ...: df.loc[df['used'] == 0.0, 'alert'] = 'Empty'
         ...: df.loc[(df['used'] >0.0) & (df['used'] < 1.0), 'alert'] = 'Partial'
         ...:
    10 loops, best of 3: 21.9 ms per loop
    
    In [851]: %timeit df['alert'] = df.apply(alert, axis=1)
    1 loop, best of 3: 2.79 s per loop
    
    0 讨论(0)
  • 2020-11-29 02:28

    Can't comment so making a new answer: Improving on Ffisegydd's approach, you can use a dictionary and the dict.get() method to make the function to pass in to .apply() easier to manage:

    import pandas as pd
    
    def alert(c):
        mapping = {1.0: 'Full', 0.0: 'Empty'}
        return mapping.get(c['used'], 'Partial')
    
    df = pd.DataFrame(data={'portion':[1, 2, 3, 4], 'used':[1.0, 0.3, 0.0, 0.8]})
    
    df['alert'] = df.apply(alert, axis=1)
    

    Depending on the use case, you might like to define the dict outside of the function definition as well.

    0 讨论(0)
  • 2020-11-29 02:32

    Alternatively you could do:

    import pandas as pd
    import numpy as np
    df = pd.DataFrame(data={'portion':np.arange(10000), 'used':np.random.rand(10000)})
    
    %%timeit
    df.loc[df['used'] == 1.0, 'alert'] = 'Full'
    df.loc[df['used'] == 0.0, 'alert'] = 'Empty'
    df.loc[(df['used'] >0.0) & (df['used'] < 1.0), 'alert'] = 'Partial'
    

    Which gives the same output but runs about 100 times faster on 10000 rows:

    100 loops, best of 3: 2.91 ms per loop
    

    Then using apply:

    %timeit df['alert'] = df.apply(alert, axis=1)
    
    1 loops, best of 3: 287 ms per loop
    

    I guess the choice depends on how big is your dataframe.

    0 讨论(0)
  • 2020-11-29 02:48

    You can define a function which returns your different states "Full", "Partial", "Empty", etc and then use df.apply to apply the function to each row. Note that you have to pass the keyword argument axis=1 to ensure that it applies the function to rows.

    import pandas as pd
    
    def alert(c):
      if c['used'] == 1.0:
        return 'Full'
      elif c['used'] == 0.0:
        return 'Empty'
      elif 0.0 < c['used'] < 1.0:
        return 'Partial'
      else:
        return 'Undefined'
    
    df = pd.DataFrame(data={'portion':[1, 2, 3, 4], 'used':[1.0, 0.3, 0.0, 0.8]})
    
    df['alert'] = df.apply(alert, axis=1)
    
    #    portion  used    alert
    # 0        1   1.0     Full
    # 1        2   0.3  Partial
    # 2        3   0.0    Empty
    # 3        4   0.8  Partial
    
    0 讨论(0)
提交回复
热议问题