pandas, combine rows based on certain column values and NAN

回眸只為那壹抹淺笑 提交于 2021-02-18 07:41:31

问题


So I have a pandas dataframe that looks like this:

id_1    id_2    value1    value2
1       2       100       NAN
1       2       NAN       101
10      20      200       NAN    
10      20      NAN       202
10      2       345       345

And I want a dataframe like this:

id_1    id_2    value1    value2
1       2       100       101
10      20      200       202    
a       b       c         d

Basically, if both ID columns match up, then there will definitely be a value-nan vs nan-value situation, and I want to combine the rows by just replacing the nans.

Does pandas have a utility for this? It's not quite stacking, or melting. Maybe pivoting, but I'd need two indeces. And I want to preserve any rows that don't have both indeces matching.


回答1:


I don't think there is a single command for your purpose and many different ways to accomplish this. However, you can use melt and pivot after each other:

id_vars = ["id_1", "id_2"]
melted = df.melt(id_vars=id_vars).dropna()
pivoted = melted.pivot_table(index=id_vars, columns="variable", values="value")

print(pivoted)

    variable    value1  value2
id_1    id_2        
1       2       100.0   101.0
10      2       345.0   345.0
        20      200.0   202.0

But, the above solution is slower than two following solutions.

First, you can use forward fill ffill to fill NaNs and last to get the last row which contains all valid values due to ffill:

ids = ["id_1", "id_2"]

df.groupby(ids).ffill()\
  .groupby(ids).last()\
  .reset_index()

    id_1    id_2    value1  value2
0   1       2       100     101
1   10      2       345     345
2   10      20      200     202

Second, instead of grouping twice (since ffill returns a data frame), you may use a custom apply which has the same result:

def collapse(x):
    return x.ffill().iloc[-1, 2:]

df.groupby(ids).apply(collapse).reset_index()

Even though we use an apply here, it is the fastest solution (at least for your provided dummy data - it may scale differently for larger datasets).




回答2:


One way (df is your initial dataframe):

df1=df.dropna(subset=["value1"]).drop("value2",axis=1)
df2=df.dropna(subset=["value2"]).drop("value1",axis=1)
dfNew=pd.concat([df1,df2],axis=1)



回答3:


groupby + first

df=df.replace('NAN',np.nan) # make sure it is np.nan not string NAN

df.groupby(['id_1','id_2'],as_index=False).first()
Out[37]: 
   id_1  id_2 value1 value2
0     1     2    100    101
1    10     2    345    345
2    10    20    200    202



回答4:


You can also sum it together as np.nan will be ignored by default.

df = df.replace("NAN", np.nan). # turn "NAN" to np.nan
df.groupby(["id_1", "id_2"])["value1", "value2"].sum().reset_index()


来源:https://stackoverflow.com/questions/48115481/pandas-combine-rows-based-on-certain-column-values-and-nan

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!