Stacking select columns as rows in pandas dataframe

☆樱花仙子☆ 提交于 2021-01-28 05:23:53

问题


Suppose I have df_in below:

df_in = pd.DataFrame({'X': ['a', 'b', 'c'], 'A': [1, 0, 0], 'B': [1, 1, 0]})

df_in:

+---+---+---+---+
|   | X | A | B |
+---+---+---+---+
| 0 | a | 1 | 1 |
| 1 | b | 0 | 1 |
| 2 | c | 0 | 0 |
+---+---+---+---+

I want to achieve something like the following:

df_out = pd.DataFrame({'X': ['a', 'a', 'b'], 'Y': ['A', 'B', 'B']})

df_out:

+---+---+---+
|   | X | Y |
+---+---+---+
| 0 | a | A |
| 1 | a | B |
| 2 | b | B |
+---+---+---+

I also have a list containing the columns: l = list(['A', 'B']). The logic is, for each column in df_in that is in l, repeat those observations where the column value == 1, and add the column name to a new column in df_out, this is Y in the example. In reality there are more columns in df_in and not all of them are in l, which is why I want to solve this without explicit references to columns A, B and X.

NOTE: This is not entirely covered by this answer since, as stated above, there are many columns in reality, and these can be of any type and data, so the solution, df_out, needs to take into account all of the original columns (X in this case). In theory, X can also be a binary 0/1 column, but should only affect the outcome in the same way as A and B if it's included in l. I hope this helps clarify.


回答1:


Use Index.difference for all columns without l pass to DataFrame.set_index, reshape by DataFrame.stack, filter only 1 and last convert MultiIndex.to_frame to new DataFrame with rename last column:

l = ['A', 'B']

c = df_in.columns.difference(l, sort=False).tolist()
s = df_in.set_index(c).stack()
df_out = s[s == 1].index.to_frame(index=False).rename(columns={len(c):'Y'})
print (df_out)
   X  Y
0  a  A
1  a  B
2  b  B


来源:https://stackoverflow.com/questions/60167436/stacking-select-columns-as-rows-in-pandas-dataframe

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!