Pandas: occurrence matrix from one hot encoding from pandas dataframe

半腔热情 提交于 2020-01-21 16:43:47

问题


I have a dataframe, it's in one hot format:

dummy_data = {'a': [0,0,1,0],'b': [1,1,1,0], 'c': [0,1,0,1],'d': [1,1,1,0]}
data = pd.DataFrame(dummy_data)

Output:

   a  b  c  d
0  0  1  0  1
1  0  1  1  1
2  1  1  0  1
3  0  0  1  0

I am trying to get the occurrence matrix from dataframe, but if I have columns name in list instead of one hot like this:

raw = [['b','d'],['b','c','d'],['a','b','d'],['c']]
unique_categories = ['a','b','c','d']

Then I am able to find the occurrence matrix like this:

df = pd.DataFrame(raw).stack().rename('val').reset_index().drop(columns='level_1')
df = df.loc[df.val.isin(unique_categories)]

df = df.merge(df, on='level_0').query('val_x != val_y')
final = pd.crosstab(df.val_x, df.val_y)

adj_matrix = (pd.crosstab(df.val_x, df.val_y)
              .reindex(unique_categories, axis=0).reindex(unique_categories, axis=1)).fillna(0)

Output:

val_y  a  b  c  d
val_x            
a      0  1  0  1
b      1  0  1  3
c      0  1  0  1
d      1  3  1  0

How to get the occurrence matrix directly from one hot dataframe?


回答1:


You can have some fun with matrix math!


u = np.diag(np.ones(df.shape[1], dtype=bool))

df.T.dot(df) * (~u)

   a  b  c  d
a  0  1  0  1
b  1  0  1  3
c  0  1  0  1
d  1  3  1  0


来源:https://stackoverflow.com/questions/58887733/pandas-occurrence-matrix-from-one-hot-encoding-from-pandas-dataframe

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!