I have a pandas dataframe that looks like this:
portion used
0 1 1.0
1 2 0.3
2 3 0.0
3 4 0.8
I\'d
df['TaxStatus'] = np.where(df.Public == 1, True, np.where(df.Public == 2, False))
This would appear to work, except for the ValueError: either both or neither of x and y should be given
Use np.where
, is usually fast
In [845]: df['alert'] = np.where(df.used == 1, 'Full',
np.where(df.used == 0, 'Empty', 'Partial'))
In [846]: df
Out[846]:
portion used alert
0 1 1.0 Full
1 2 0.3 Partial
2 3 0.0 Empty
3 4 0.8 Partial
Timings
In [848]: df.shape
Out[848]: (100000, 3)
In [849]: %timeit df['alert'] = np.where(df.used == 1, 'Full', np.where(df.used == 0, 'Empty', 'Partial'))
100 loops, best of 3: 6.17 ms per loop
In [850]: %%timeit
...: df.loc[df['used'] == 1.0, 'alert'] = 'Full'
...: df.loc[df['used'] == 0.0, 'alert'] = 'Empty'
...: df.loc[(df['used'] >0.0) & (df['used'] < 1.0), 'alert'] = 'Partial'
...:
10 loops, best of 3: 21.9 ms per loop
In [851]: %timeit df['alert'] = df.apply(alert, axis=1)
1 loop, best of 3: 2.79 s per loop
Can't comment so making a new answer: Improving on Ffisegydd's approach, you can use a dictionary and the dict.get()
method to make the function to pass in to .apply()
easier to manage:
import pandas as pd
def alert(c):
mapping = {1.0: 'Full', 0.0: 'Empty'}
return mapping.get(c['used'], 'Partial')
df = pd.DataFrame(data={'portion':[1, 2, 3, 4], 'used':[1.0, 0.3, 0.0, 0.8]})
df['alert'] = df.apply(alert, axis=1)
Depending on the use case, you might like to define the dict outside of the function definition as well.
Alternatively you could do:
import pandas as pd
import numpy as np
df = pd.DataFrame(data={'portion':np.arange(10000), 'used':np.random.rand(10000)})
%%timeit
df.loc[df['used'] == 1.0, 'alert'] = 'Full'
df.loc[df['used'] == 0.0, 'alert'] = 'Empty'
df.loc[(df['used'] >0.0) & (df['used'] < 1.0), 'alert'] = 'Partial'
Which gives the same output but runs about 100 times faster on 10000 rows:
100 loops, best of 3: 2.91 ms per loop
Then using apply:
%timeit df['alert'] = df.apply(alert, axis=1)
1 loops, best of 3: 287 ms per loop
I guess the choice depends on how big is your dataframe.
You can define a function which returns your different states "Full", "Partial", "Empty", etc and then use df.apply
to apply the function to each row. Note that you have to pass the keyword argument axis=1
to ensure that it applies the function to rows.
import pandas as pd
def alert(c):
if c['used'] == 1.0:
return 'Full'
elif c['used'] == 0.0:
return 'Empty'
elif 0.0 < c['used'] < 1.0:
return 'Partial'
else:
return 'Undefined'
df = pd.DataFrame(data={'portion':[1, 2, 3, 4], 'used':[1.0, 0.3, 0.0, 0.8]})
df['alert'] = df.apply(alert, axis=1)
# portion used alert
# 0 1 1.0 Full
# 1 2 0.3 Partial
# 2 3 0.0 Empty
# 3 4 0.8 Partial