问题
I have the following MCVE:
#!/usr/bin/env python3
import pandas as pd
df = pd.DataFrame([True, False, True])
print("Whole DataFrame:")
print(df)
print("\nFiltered DataFrame:")
print(df[df[0] == True])
The output is the following, which I expected:
Whole DataFrame:
0
0 True
1 False
2 True
Filtered DataFrame:
0
0 True
2 True
Okay, but the PEP8 style seems to be wrong, it says: E712 comparison to True should be if cond is True
or if cond
. So I changed it to is True
instead of == True
but now it fails, the output is:
Whole DataFrame:
0
0 True
1 False
2 True
Filtered DataFrame:
0 True
1 False
2 True
Name: 0, dtype: bool
What is going on?
回答1:
The catch here is that in df[df[0] == True]
, you are not comparing objects to True
.
As the other answers say, ==
is overloaded in pandas
to produce a Series
instead of a bool
as it normally does. []
is overloaded, too, to interpret the Series
and give the filtered result. The code is essentially equivalent to:
series = df[0].__eq__(True)
df.__getitem__(series)
So, you're not violating PEP8 by leaving ==
here.
Essentially, pandas
gives familiar syntax unusual semantics - that is what caused the confusion.
Accoring to Stroustroup (sec.3.3.3), operator overloading has been causing trouble due to this ever since its invention (and he had to think hard whether to include it into C++). Seeing even more abuse of it in C++, Gosling ran to the other extreme in Java, banning it completely, and that proved to be exactly that, an extreme.
As a conclusion, modern languages and code tend to have operator overloading but watch closely not to overuse it and for semantics to stay consistent.
回答2:
In python, is
tests if an object is the same as another.
==
is defined by a pandas.Series
to act element-wise, is
is not.
Because of that, df[0] is True
compares if df[0]
and True
are the same object. The result is False
, which in turn is equal to 0
, so you get the 0
columns when doing df[df[0] is True]
回答3:
This is an elaboration on MaxNoe's answer since this was to lengthy to include in the comments.
As he indicated, df[0] is True
evaluates to False
, which is then coerced
to 0
which corresponds to a column name. What is interesting about this is
that if you run
>>>df = pd.DataFrame([True, False, True])
>>>df[False]
KeyError Traceback (most recent call last)
<ipython-input-21-62b48754461f> in <module>()
----> 1 df[False]
>>>df[0]
0 True
1 False
2 True
Name: 0, dtype: bool
>>>df[False]
0 True
1 False
2 True
Name: 0, dtype: bool
This seems a bit perplexing at first (to me at least) but has to do with how
pandas
makes use of caching. If you look at how df[False]
is resolved, it
looks like
/home/matthew/anaconda/lib/python2.7/site-packages/pandas/core/frame.py(1975)__getitem__()
-> return self._getitem_column(key)
/home/matthew/anaconda/lib/python2.7/site-packages/pandas/core/frame.py(1999)_getitem_column()
-> return self._get_item_cache(key)
> /home/matthew/anaconda/lib/python2.7/site-packages/pandas/core/generic.py(1343)_get_item_cache()
-> res = cache.get(item)
Since cache
is just a regular python dict
, after running df[0]
the cache
looks like
>>>cache
{0: 0 True
1 False
2 True
Name: 0, dtype: bool}
so that when we look up False
, python coerces this to 0
. If we have not
already primed the cache using df[0]
, then res
is None
which triggers a
KeyError
on line 1345 of generic.py
def _get_item_cache(self, item):
1341 """Return the cached item, item represents a label indexer."""
1342 cache = self._item_cache
1343 -> res = cache.get(item)
1344 if res is None:
1345 values = self._data.get(item)
回答4:
I think in pandas
comparison only works with ==
and result is boolean Series
. With is
output is False
. More info about is.
print df[0] == True
0 True
1 False
2 True
Name: 0, dtype: bool
print df[df[0]]
0
0 True
2 True
print df[df[0] == True]
0
0 True
2 True
print df[0] is True
False
print df[df[0] is True]
0 True
1 False
2 True
Name: 0, dtype: bool
回答5:
One workaround for not having complaints from linters but still reasonable syntax for sub-setting could be:
s = pd.Series([True] * 10 + [False])
s.loc[s == True] # bad comparison in Python's eyes
s.loc[s.isin([True])] # valid comparison, not as ugly as s.__eq__(True)
Both also take the same time.
In addition, for dataframes one can use query
:
df = pd.DataFrame([
[True] * 10 + [False],
list(range(11))],
index=['T', 'N']).T
df.query("T == True") # also okay
来源:https://stackoverflow.com/questions/36825925/expressions-with-true-and-is-true-give-different-results