How to find the correlation between two strings in pandas

拟墨画扇 提交于 2019-12-25 04:21:58

问题


I have df of string values

   Keyword
    plant
    cell
    cat
    Pandas

And I want to find the relationship or correlation between these two string values.

I have used pandas corr = df1.corrwith(df2,axis=0). But this is useful for to find the correlation between the numerical values but I want to see whether the two strings are related by finding the correlation distance. How can I do that?


回答1:


There are a few steps here, the first thing you need to do is extract some sort of vector for each word.

A good way is using gensim word2vec (you need to download the files from here):

from gensim.models import KeyedVectors

model = KeyedVectors.load_word2vec_format('data/GoogleGoogleNews-vectors-negative300.bin', binary=True)

after getting the pretrained vectors you need to extract the vector for each word:

vector = model['plant']

or in the pandas column example:

df['Vectors'] = df['Keyword'].apply(lambda x: model[x])

Once this is done you can calculate the distance between two vectors using a number of methodologies, eg euclidean distance:

from sklearn.metrics.pairwise import euclidean_distances
distances = euclidean_distances(list(df['Vectors']))

distances will be a matrix, with 0 on the diagonal and the distance of all words from each other. The closer a distance is to 0, the more similar the words are.

You can use different models and different distance metrics, but you can use this as a starting point.



来源:https://stackoverflow.com/questions/55394673/how-to-find-the-correlation-between-two-strings-in-pandas

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!