Consider this simple example
import pandas as pd
df = pd.DataFrame({'one' : [1,2,3],
'two' : [1,0,0]})
df
Out[9]:
one two
0 1 1
1 2 0
2 3 0
I want to write a function that takes as inputs a dataframe df
and a column mycol
.
Now this works:
df.groupby('one').two.sum()
Out[10]:
one
1 1
2 0
3 0
Name: two, dtype: int64
this works too:
def okidoki(df,mycol):
return df.groupby('one')[mycol].sum()
okidoki(df, 'two')
Out[11]:
one
1 1
2 0
3 0
Name: two, dtype: int64
but this FAILS
def megabug(df,mycol):
return df.groupby('one').mycol.sum()
megabug(df, 'two')
AttributeError: 'DataFrameGroupBy' object has no attribute 'mycol'
What is wrong here?
I am worried that okidoki
uses some chaining that might create some subtle bugs (https://pandas.pydata.org/pandas-docs/stable/indexing.html#why-does-assignment-fail-when-using-chained-indexing).
How can I still keep the syntax groupby('one').mycol
? Can the mycol
string be converted to something that might work that way?
Thanks!
You pass a string as the second argument. In effect, you're trying to do something like:
df.'two'
Which is invalid syntax. If you're trying to dynamically access a column, you'll need to use the index notation, [...]
because the dot/attribute accessor notation doesn't work for dynamic access.
Dynamic access on its own is possible. For example, you can use getattr
(but I don't recommend this, it's an antipattern):
In [674]: df
Out[674]:
one two
0 1 1
1 2 0
2 3 0
In [675]: getattr(df, 'one')
Out[675]:
0 1
1 2
2 3
Name: one, dtype: int64
Dynamically selecting by attribute from a groupby call can be done, something like:
In [677]: getattr(df.groupby('one'), mycol).sum()
Out[677]:
one
1 1
2 0
3 0
Name: two, dtype: int64
But don't do it. It is a horrid anti pattern, and much more unreadable than df.groupby('one')[mycol].sum()
.
I think you need []
for select column by column name what is general solution for selecting columns, because select by attributes have many exceptions:
- You can use this access only if the index element is a valid python identifier, e.g. s.1 is not allowed. See here for an explanation of valid identifiers.
- The attribute will not be available if it conflicts with an existing method name, e.g. s.min is not allowed.
- Similarly, the attribute will not be available if it conflicts with any of the following list: index, major_axis, minor_axis, items, labels.
- In any of these cases, standard indexing will still work, e.g. s['1'], s['min'], and s['index'] will access the corresponding element or column.
def megabug(df,mycol):
return df.groupby('one')[mycol].sum()
print (megabug(df, 'two'))
one
1 1
2 0
3 0
Name: two, dtype: int64
来源:https://stackoverflow.com/questions/45921561/dynamically-accessing-a-pandas-dataframe-column