Removing rows in Pandas based on multiple columns

隐身守侯 提交于 2021-02-11 12:40:19

问题


In Pandas, I have a dataframe with ZipCode, Age, and a bunch of columns that should all have values 1 or 0, ie:

ZipCode Age A B C D

12345   21  0 1 1 1
12345   22  1 0 1 4
23456   45  1 0 1 1
23456   21  3 1 0 0

I want to delete all rows in which 0 or 1 doesn't appear in columns A,B,C, or D as a way to clean up the data. In this case, I would remove the 2nd and 4th row because 4 appears in column D in row 2 and 3 appears in column A in row 4. I want to do this even if I have 100 columns to check such that I don't have to look up every column one by one in my conditional statement. How would I do this?


回答1:


Use isin to test for membership and all to test if all row values are True and use this boolean mask to filter the df:

In [12]:
df[df.ix[:,'A':].isin([0,1]).all(axis=1)]

Out[12]:
   ZipCode  Age  A  B  C  D
0    12345   21  0  1  1  1
2    23456   45  1  0  1  1



回答2:


You can opt for a vectorized solution:

In [64]: df[df[['A','B','C','D']].isin([0,1]).sum(axis=1)==4]

Out[64]:
   ZipCode  Age  A  B  C  D
0    12345   21  0  1  1  1
2    23456   45  1  0  1  1



回答3:


Other two solutions works well but if you interested in speed you should look at numpy in1d function:

data=df.loc[:, 'A':]

In [72]:  df[np.in1d(data.values,[0,1]).reshape(data.shape).all(axis=1)]
Out[72]:
   ZipCode  Age  A  B  C  D
0    12345   21  0  1  1  1
2    23456   45  1  0  1  1

Timing:

In [73]: %timeit data=df.loc[:, 'A':]; df[np.in1d(data.values,[0,1]).reshape(data.shape).all(axis=1)]
1000 loops, best of 3: 558 us per loop

In [74]: %timeit df[df.ix[:,'A':].isin([0,1]).all(axis=1)]
1000 loops, best of 3: 843 us per loop

In [75]: %timeit df[df[['A','B','C','D']].isin([0,1]).sum(axis=1)==4]
1000 loops, best of 3: 1.44 ms per loop


来源:https://stackoverflow.com/questions/37134193/removing-rows-in-pandas-based-on-multiple-columns

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!