“Correlation matrix” for strings. Similarity of nominal data

只谈情不闲聊 提交于 2019-12-31 04:56:06

问题


Here is my data frame. df

  store_1      store_2         store_3         store_4     

0 banana      banana           plum            banana
1 orange      tangerine        pear            orange
2 apple       pear             melon           apple
3 pear        raspberry        pineapple       plum
4 plum        tomato           peach           tomato

I'm looking for the way to count number of co-occurrences in stores (to compare their similarity).


回答1:


You can try something like this

import itertools as it
corr = lambda a,b: len(set(a).intersection(set(b)))/len(a)
c = [corr(*x) for x in it.combinations_with_replacement(df.T.values.tolist(),2)]

j = 0
x = []
for i in range(4, 0, -1): # replace 4 with df.shape[-1]
    x.append([np.nan]*(4-i) + c[j:j+i])
    j+= i
pd.DataFrame(x, columns=df.columns, index=df.columns)

Which yields

        store_1 store_2 store_3 store_4
store_1 1.0     0.4     0.4     0.8
store_2 NaN     1.0     0.2     0.4
store_3 NaN     NaN     1.0     0.2
store_4 NaN     NaN     NaN     1.0



回答2:


If you wish to estimate the similarity of the stores with regards to their products, then you could use:

One hot encoding

Then each stores can be described by a vector with length of n = number of all products among all stores such as:

banana orange apple pear plum tangerin raspberry tomato melon . . .

Store_1 then is described as 1 1 1 1 1 0 0 0 0 0 ... Store_2 1 0 0 1 0 1 1 1 0 ...

This leaves you with a numerical vector, where you can compute dissimilarity measure such as Euclidean Distance.



来源:https://stackoverflow.com/questions/54279080/correlation-matrix-for-strings-similarity-of-nominal-data

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!