pandas, combine rows based on certain column values and NAN

问题

So I have a pandas dataframe that looks like this:

id_1    id_2    value1    value2
1       2       100       NAN
1       2       NAN       101
10      20      200       NAN    
10      20      NAN       202
10      2       345       345

And I want a dataframe like this:

id_1    id_2    value1    value2
1       2       100       101
10      20      200       202    
a       b       c         d

Basically, if both ID columns match up, then there will definitely be a value-nan vs nan-value situation, and I want to combine the rows by just replacing the nans.

Does pandas have a utility for this? It's not quite stacking, or melting. Maybe pivoting, but I'd need two indeces. And I want to preserve any rows that don't have both indeces matching.

回答1:

I don't think there is a single command for your purpose and many different ways to accomplish this. However, you can use melt and pivot after each other:

id_vars = ["id_1", "id_2"]
melted = df.melt(id_vars=id_vars).dropna()
pivoted = melted.pivot_table(index=id_vars, columns="variable", values="value")

print(pivoted)

    variable    value1  value2
id_1    id_2        
1       2       100.0   101.0
10      2       345.0   345.0
        20      200.0   202.0

But, the above solution is slower than two following solutions.

First, you can use forward fill ffill to fill NaNs and last to get the last row which contains all valid values due to ffill:

ids = ["id_1", "id_2"]

df.groupby(ids).ffill()\
  .groupby(ids).last()\
  .reset_index()

    id_1    id_2    value1  value2
0   1       2       100     101
1   10      2       345     345
2   10      20      200     202

Second, instead of grouping twice (since ffill returns a data frame), you may use a custom apply which has the same result:

def collapse(x):
    return x.ffill().iloc[-1, 2:]

df.groupby(ids).apply(collapse).reset_index()

Even though we use an apply here, it is the fastest solution (at least for your provided dummy data - it may scale differently for larger datasets).

回答2:

One way (df is your initial dataframe):

df1=df.dropna(subset=["value1"]).drop("value2",axis=1)
df2=df.dropna(subset=["value2"]).drop("value1",axis=1)
dfNew=pd.concat([df1,df2],axis=1)

回答3:

groupby + first

df=df.replace('NAN',np.nan) # make sure it is np.nan not string NAN

df.groupby(['id_1','id_2'],as_index=False).first()
Out[37]: 
   id_1  id_2 value1 value2
0     1     2    100    101
1    10     2    345    345
2    10    20    200    202

回答4:

You can also sum it together as np.nan will be ignored by default.

df = df.replace("NAN", np.nan). # turn "NAN" to np.nan
df.groupby(["id_1", "id_2"])["value1", "value2"].sum().reset_index()

来源：https://stackoverflow.com/questions/48115481/pandas-combine-rows-based-on-certain-column-values-and-nan

标签

python

pandas