I want to add a column to a df. The values of this new df will be dependent upon the values of the other columns. eg
dc = {\'A\':[0,9,4,5],\'B\':[6,0,10,12],
here's a start:
df['D'] = np.nan
df['D'].loc[df[(df.A != 0) & (df.B != 0)].index] = df.A / df.B.astype(np.float) * df.C
edit, you should probably just go ahead and cast the whole thing to floats unless you really care about integers for some reason:
df = df.astype(np.float)
and then you don't have to constantly keep converting in call itself
apply
should work well for you:
In [20]: def func(row):
if (row == 0).all():
return 250.0
elif (row[['A', 'B']] != 0).all():
return (float(row['A']) / row['B'] ) * row['C']
else:
return 20
....:
In [21]: df['D'] = df.apply(func, axis=1)
In [22]: df
Out[22]:
A B C D
0 0 6 1 20.0
1 9 0 3 20.0
2 4 10 15 6.0
3 5 12 18 7.5
[4 rows x 4 columns]
.where
can be much faster than .apply
, so if all you're doing is if/elses then I'd aim for .where
. As you're returning scalars in some cases, np.where
will be easier to use than pandas' own .where
.
import pandas as pd
import numpy as np
df['D'] = np.where((df.A!=0) & (df.B!=0), ((df.A/df.B)*df.C),
np.where((df.C==0) & (df.A!=0) & (df.B==0), 250,
20))
A B C D
0 0 6 1 20.0
1 9 0 3 20.0
2 4 10 15 6.0
3 5 12 18 7.5
For a tiny df like this, you wouldn't need to worry about speed. However, on a 10000 row df of randn, this is almost 2000 times faster than the .apply
solution above: 3ms vs 5850ms. That said if speed isn't a concern, then .apply can often be easier to read.