How to find the correlation between two strings in pandas

问题

I have df of string values

   Keyword
    plant
    cell
    cat
    Pandas

And I want to find the relationship or correlation between these two string values.

I have used pandas corr = df1.corrwith(df2,axis=0). But this is useful for to find the correlation between the numerical values but I want to see whether the two strings are related by finding the correlation distance. How can I do that?

回答1:

There are a few steps here, the first thing you need to do is extract some sort of vector for each word.

A good way is using gensim word2vec (you need to download the files from here):

from gensim.models import KeyedVectors

model = KeyedVectors.load_word2vec_format('data/GoogleGoogleNews-vectors-negative300.bin', binary=True)

after getting the pretrained vectors you need to extract the vector for each word:

vector = model['plant']

or in the pandas column example:

df['Vectors'] = df['Keyword'].apply(lambda x: model[x])

Once this is done you can calculate the distance between two vectors using a number of methodologies, eg euclidean distance:

from sklearn.metrics.pairwise import euclidean_distances
distances = euclidean_distances(list(df['Vectors']))

distances will be a matrix, with 0 on the diagonal and the distance of all words from each other. The closer a distance is to 0, the more similar the words are.

You can use different models and different distance metrics, but you can use this as a starting point.

来源：https://stackoverflow.com/questions/55394673/how-to-find-the-correlation-between-two-strings-in-pandas

标签

python

string

pandas

dataframe

correlation