问题
I have the following Pandas dataframe:
Index Name ID1 ID2 ID3
1 A Y Y Y
2 B Y Y
3 B Y
4 C Y
I wish to add a new column 'Multiple' to indicate those rows where there is a value Y in more than one of the columns ID1, ID2, and ID3.
Index Name ID1 ID2 ID3 Multiple
1 A Y Y Y Y
2 B Y Y Y
3 B Y N
4 C Y N
I'd normally use np.where
or np.select
e.g.:
df['multiple'] = np.where(<More than 1 of ID1, ID2 or ID3 have a Y in>), 'Y', 'N')
but I can't figure out how to write the conditional. There might be a growing number of ID columns so I couldn't cover every combination as a separate condition (e.g. (ID1 = Y and ID3 = Y) or (ID2 = Y and ID3 = Y)
. I think I perhaps want something which counts the Y values across named columns?
Outside of Pandas I would think about working with a list, appending the values for each column where Y and then see if the list had a length of greater than 1.
But I cant think how to do it within the limitations of np.where
, np.select
or df.loc
.
Any pointers?
回答1:
using numpy to sum by row to occurrences of Y should do it:
df['multi'] = ['Y' if x > 1 else 'N' for x in np.sum(df.values == 'Y', 1)]
output:
Name ID1 ID2 ID3 multi
Index
1 A Y Y Y Y
2 B Y Y None Y
3 B Y None None N
4 C Y None None N
回答2:
I would do it like this:
Get a list of the columns you want to check.
cols = [x for x in testdf.columns if "id" in x]
You can use the filter
method on DataFrame if you want for this, but I think explicitly selecting the list of columns is clearer, and you have full flexibility to change your conditions later.
After that, it's just:
testdf["multiple"] = (testdf[cols]=="Y").any(axis="columns")
Explanation:
testdf[cols]
returns a DataFrame conisisting of just the columns you have selected for in the first line.testdf[cols]=="Y"
returns a DataFrame populated with True or False as per the condition "==Y".- ().any(axis="columns") scans across the columns of this DataFrame and, for each row, returns True for if any of the items in the row are True, and False otherwise.
If you really want you can change the True values to "Y" and the False values to "N".
来源:https://stackoverflow.com/questions/56739320/pandas-check-if-a-value-exists-in-multiple-columns-for-each-row