问题
So I have a pandas dataframe that looks like this:
id_1 id_2 value1 value2
1 2 100 NAN
1 2 NAN 101
10 20 200 NAN
10 20 NAN 202
10 2 345 345
And I want a dataframe like this:
id_1 id_2 value1 value2
1 2 100 101
10 20 200 202
a b c d
Basically, if both ID columns match up, then there will definitely be a value-nan
vs nan-value
situation, and I want to combine the rows by just replacing the nans
.
Does pandas have a utility for this? It's not quite stacking, or melting. Maybe pivoting, but I'd need two indeces. And I want to preserve any rows that don't have both indeces matching.
回答1:
I don't think there is a single command for your purpose and many different ways to accomplish this. However, you can use melt
and pivot
after each other:
id_vars = ["id_1", "id_2"]
melted = df.melt(id_vars=id_vars).dropna()
pivoted = melted.pivot_table(index=id_vars, columns="variable", values="value")
print(pivoted)
variable value1 value2
id_1 id_2
1 2 100.0 101.0
10 2 345.0 345.0
20 200.0 202.0
But, the above solution is slower than two following solutions.
First, you can use forward fill ffill
to fill NaNs and last
to get the last row which contains all valid values due to ffill
:
ids = ["id_1", "id_2"]
df.groupby(ids).ffill()\
.groupby(ids).last()\
.reset_index()
id_1 id_2 value1 value2
0 1 2 100 101
1 10 2 345 345
2 10 20 200 202
Second, instead of grouping twice (since ffill
returns a data frame), you may use a custom apply
which has the same result:
def collapse(x):
return x.ffill().iloc[-1, 2:]
df.groupby(ids).apply(collapse).reset_index()
Even though we use an apply here, it is the fastest solution (at least for your provided dummy data - it may scale differently for larger datasets).
回答2:
One way (df is your initial dataframe):
df1=df.dropna(subset=["value1"]).drop("value2",axis=1)
df2=df.dropna(subset=["value2"]).drop("value1",axis=1)
dfNew=pd.concat([df1,df2],axis=1)
回答3:
groupby
+ first
df=df.replace('NAN',np.nan) # make sure it is np.nan not string NAN
df.groupby(['id_1','id_2'],as_index=False).first()
Out[37]:
id_1 id_2 value1 value2
0 1 2 100 101
1 10 2 345 345
2 10 20 200 202
回答4:
You can also sum it together as np.nan
will be ignored by default.
df = df.replace("NAN", np.nan). # turn "NAN" to np.nan
df.groupby(["id_1", "id_2"])["value1", "value2"].sum().reset_index()
来源:https://stackoverflow.com/questions/48115481/pandas-combine-rows-based-on-certain-column-values-and-nan