Compare multiple pandas columns (1st and 2nd, after 3rd and 4rth, after etc) with vectorization (better) or other method

问题

This code compares based on condition the var1 and var2 and creates Results1 based on choices (this code works well):

# from: https://stackoverflow.com/questions/27474921/compare-two-columns-using-pandas?answertab=oldest#tab-top
# from: https://stackoverflow.com/questions/60099141/negation-in-np-select-condition

import pandas as pd
import numpy as np

# Creating one column from two columns. We asume that in every row there is one NaN and one value and that value fills new column.
df = pd.DataFrame({ 'var1': ['a', 'b', 'c',np.nan, np.nan],
                   'var2': [1, 2, np.nan , 4, np.nan], 
                   'var3': [np.nan , "x", np.nan, "y", "z"],
                   'var4': [np.nan , 4, np.nan, 5, 6],
                   'var5': ["a", np.nan , "b", np.nan, "c"],
                   'var6': [1, np.nan , 2, np.nan, 3]
                 })


#all conditions that are connected with logical operators (&, |, etc) should be in ().
conditions = [
    (df["var1"].notna()) & (df['var2'].notna()),
    (pd.isna(df["var1"])) & (pd.isna(df["var2"])),
    (df["var1"].notna()) & (pd.isna(df["var2"])),
    (pd.isna(df["var1"])) & (df['var2'].notna())]

choices = ["Both values", np.nan, df["var1"], df["var2"]]

df['Result1'] = np.select(conditions, choices, default=np.nan)

df looks like as it should:

|    | var1   |   var2 | var3   |   var4 | var5   |   var6 | Result1     |
|---:|:-------|-------:|:-------|-------:|:-------|-------:|:------------|
|  0 | a      |      1 | nan    |    nan | a      |      1 | Both values |
|  1 | b      |      2 | x      |      4 | nan    |    nan | Both values |
|  2 | c      |    nan | nan    |    nan | b      |      2 | c           |
|  3 | nan    |      4 | y      |      5 | nan    |    nan | 4           |
|  4 | nan    |    nan | z      |      6 | c      |      3 | nan         |

Now I want to compare multiple pandas columns (in my example var1 and var2, after var3 and var4, after var5 and var6) and based on condition and choices create corresponding Results column (in my example Result1, Result2, Result3). I thought the best way should be to use vectorization (because of better performance). The df I want to get should look like:

|    | var1   |   var2 | var3   |   var4 | var5   |   var6 | Result1     | Result2     | Result3     |
|---:|:-------|-------:|:-------|-------:|:-------|-------:|:------------|:------------|:------------|
|  0 | a      |      1 | nan    |    nan | a      |      1 | Both values | nan         | Both values |
|  1 | b      |      2 | x      |      4 | nan    |    nan | Both values | Both values | nan         |
|  2 | c      |    nan | nan    |    nan | b      |      2 | c           | nan         | Both values |
|  3 | nan    |      4 | y      |      5 | nan    |    nan | 4           | Both values | nan         |
|  4 | nan    |    nan | z      |      6 | c      |      3 | nan         | Both values | Both values |

I tried this:

import pandas as pd
import numpy as np

# Creating one column from two columns. We asume that in every row there is one NaN and one value and that value fills new column.
df = pd.DataFrame({ 'var1': ['a', 'b', 'c',np.nan, np.nan],
                   'var2': [1, 2, np.nan , 4, np.nan], 
                   'var3': [np.nan , "x", np.nan, "y", "z"],
                   'var4': [np.nan , 4, np.nan, 5, 6],
                   'var5': ["a", np.nan , "b", np.nan, "c"],
                   'var6': [1, np.nan , 2, np.nan, 3]
                 })


col1 = ["var1", "var3", "var5"]
col2 = ["var2", "var4", "var6"]
colR = ["Result1", "Result2", "Result3"]

#all conditions that are connected with logical operators (&, |, etc) should be in ().
conditions = [
    (df[col1].notna()) & (df[col2].notna()),
    (pd.isna(df[col1])) & (pd.isna(df[col2])),
    (df[col1].notna()) & (pd.isna(df[col2])),
    (pd.isna(df[col1])) & (df[col2].notna())]

choices = ["Both values", np.nan, df[col1], df[col2]]

df[colR] = np.select(conditions, choices, default=np.nan)

Buy it gave me error:

ValueError: shape mismatch: objects cannot be broadcast to a single shape

Question: How to achieve my goal with vectorization (preferable because of better performance) or other method?

回答1:

The issue is that pandas DataFrames force alignment on the index, but df[col1] and df[col2] have no overlapping columns.

In this case, you really want to work with the underlying numpy arrays. Also because .isnull() is the opposite of notnull you can simplify this a lot. We'll concat to add the new columns back.

col1 = ["var1", "var3", "var5"]
col2 = ["var2", "var4", "var6"]
colR = ["Result1", "Result2", "Result3"]

s1 = df[col1].isnull().to_numpy()
s2 = df[col2].isnull().to_numpy()

conditions = [~s1 & ~s2, s1 & s2, ~s1 & s2, s1 & ~s2]
choices = ["Both values", np.nan, df[col1], df[col2]]

df = pd.concat([df, pd.DataFrame(np.select(conditions, choices), columns=colR, index=df.index)], axis=1)

  var1  var2 var3  var4 var5  var6      Result1      Result2      Result3
0    a   1.0  NaN   NaN    a   1.0  Both values          NaN  Both values
1    b   2.0    x   4.0  NaN   NaN  Both values  Both values          NaN
2    c   NaN  NaN   NaN    b   2.0            c          NaN  Both values
3  NaN   4.0    y   5.0  NaN   NaN            4  Both values          NaN
4  NaN   NaN    z   6.0    c   3.0          NaN  Both values  Both values

来源：https://stackoverflow.com/questions/60101845/compare-multiple-pandas-columns-1st-and-2nd-after-3rd-and-4rth-after-etc-wit

标签

python

pandas

numpy

vectorization