Pandas - check if a value exists in multiple columns for each row

北城以北 提交于 2021-01-24 11:41:40

问题


I have the following Pandas dataframe:

Index  Name  ID1  ID2  ID3
    1  A     Y    Y    Y
    2  B     Y    Y        
    3  B     Y              
    4  C               Y

I wish to add a new column 'Multiple' to indicate those rows where there is a value Y in more than one of the columns ID1, ID2, and ID3.

Index  Name  ID1  ID2  ID3 Multiple
    1  A     Y    Y    Y   Y
    2  B     Y    Y        Y
    3  B     Y             N
    4  C               Y   N

I'd normally use np.where or np.select e.g.:

df['multiple'] = np.where(<More than 1 of ID1, ID2 or ID3 have a Y in>), 'Y', 'N')

but I can't figure out how to write the conditional. There might be a growing number of ID columns so I couldn't cover every combination as a separate condition (e.g. (ID1 = Y and ID3 = Y) or (ID2 = Y and ID3 = Y). I think I perhaps want something which counts the Y values across named columns?

Outside of Pandas I would think about working with a list, appending the values for each column where Y and then see if the list had a length of greater than 1.

But I cant think how to do it within the limitations of np.where, np.select or df.loc. Any pointers?


回答1:


using numpy to sum by row to occurrences of Y should do it:

df['multi'] = ['Y' if x > 1 else 'N' for x in np.sum(df.values == 'Y', 1)]

output:

      Name ID1   ID2   ID3 multi
Index                           
1        A   Y     Y     Y     Y
2        B   Y     Y  None     Y
3        B   Y  None  None     N
4        C   Y  None  None     N



回答2:


I would do it like this:

Get a list of the columns you want to check.

    cols = [x for x in testdf.columns if "id" in x]

You can use the filter method on DataFrame if you want for this, but I think explicitly selecting the list of columns is clearer, and you have full flexibility to change your conditions later.

After that, it's just:

    testdf["multiple"] = (testdf[cols]=="Y").any(axis="columns")

Explanation:

  • testdf[cols] returns a DataFrame conisisting of just the columns you have selected for in the first line.
  • testdf[cols]=="Y" returns a DataFrame populated with True or False as per the condition "==Y".
  • ().any(axis="columns") scans across the columns of this DataFrame and, for each row, returns True for if any of the items in the row are True, and False otherwise.

If you really want you can change the True values to "Y" and the False values to "N".



来源:https://stackoverflow.com/questions/56739320/pandas-check-if-a-value-exists-in-multiple-columns-for-each-row

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!