Generating a similarity matrix from pandas dataframe

问题

I have a df

id    val1    val2    val3
100    aa      bb      cc
200    bb      cc      0
300    aa      cc      0
400    bb      aa      cc

From this I have to generate a df, something like this:

     100  200  300  400                    
100    3    2    2    3
200    2    2    1    2
300    2    1    2    2
400    3    2    2    3

Explaination: id 100 contains aa,bb,cc and 200 contains bb,cc,0

There are 2 similar values.

Therefore in my final matrix, the intersection cell for index-100 and column 200, 2 should be inserted.

Similarly for id 200- values are bb,cc,0 and that for id 300 - aa,cc,0

Here the similarity is 1, therefore in my final matrix the cell corresponding to 200(index)-300(column) should be inserted with 1.

回答1:

Some preprocessing. First, set_index to id and get rid of 0s, we don't need them.

df = df.set_index('id').replace('0', np.nan)

df    
    val1 val2 val3
id                
100   aa   bb   cc
200   bb   cc  NaN
300   aa   cc  NaN
400   bb   aa   cc

Now, use a combination of pd.get_dummies and df.dot and get your similarity scores.

x = pd.get_dummies(df)
y = x.groupby(x.columns.str.split('_').str[1], axis=1).sum()    
y.dot(y.T)

     100  200  300  400  
id                   
100    3    2    2    3
200    2    2    1    2
300    2    1    2    2
400    3    2    2    3

回答2:

you can convert the data into sets and then intersect them:

df = df.replace('0', np.nan)
c = df.apply(lambda x: set(x.dropna()), axis=1)
df2 = pd.DataFrame([[len(x.intersection(y)) for x in c] for y in c],columns=c.index,index=c.index)

The desired output will be:

     100  200  300  400
100    3    2    2    3
200    2    2    1    2
300    2    1    2    2
400    3    2    2    3

来源：https://stackoverflow.com/questions/46441705/generating-a-similarity-matrix-from-pandas-dataframe

标签

python

pandas

dataframe

similarity